[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519

xwhzz · 2025-12-24T06:35:22Z

Summary by CodeRabbit

Release Notes

New Features
- Enhanced PR regression testing workflow with improved multi-runner support and dynamic caching configuration.
- Improved performance regression analysis visualization with detailed results reporting.
Bug Fixes
- Enhanced error handling for performance metric collection with automatic retry mechanism.
Refactoring
- Updated example benchmarks and test configurations for improved performance analysis.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…formance regression testing

github-actions · 2025-12-24T06:35:32Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-24T06:35:33Z

📝 Walkthrough

Walkthrough

Comprehensive refactor of regression testing workflow and tooling: GitHub Actions workflow now supports dynamic matrix-based runners with environment isolation and CUDA configuration, examples updated with simplified benchmarking signatures and kernel parameters, and test infrastructure enhanced with retry logic and Matplotlib-based result visualization.

Changes

Cohort / File(s)	Summary
CI/CD Workflow `.github/workflows/pr-regression-test-bot.yml`	Refactored with matrix-based multi-runner strategy, permission checks, self-hosted and GitHub-hosted environment setup (cache dirs, ccache, CUDA), unified uv Python setup, isolated Python environments, baseline/cleanup logic, core-dump configuration, extended post-test steps with artifact upload and PR commenting.
Flash Attention Examples `examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py`, `examples/flash_attention/example_gqa_fwd_bshd.py`, `examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py`, `examples/flash_attention/example_mha_fwd_varlen.py`	Updated `run_regression_perf` signatures with explicit typed parameters, simplified benchmarking flows (single autograd-based path in bwd), removed kernel-only benchmark paths, kernel configuration parameter changes (block_M/N, stages, threads for varlen).
GEMM Examples `examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py`, `examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py`	FP8 example now uses tilelang tensor type constants instead of string literals; streamk regression example removed entirely.
Dynamic Shape Example `examples/dynamic_shape/example_dynamic.py`	Removed JIT pass configurations, now uses plain `@tilelang.jit` with defaults.
Test Infrastructure `tilelang/testing/perf_regression.py`	Added retry mechanism (`_MAX_RETRY_NUM = 5`) for non-positive latency collection, introduced structured warning via `warnings` module, changed `process_func` return type from `float` to `None`.
Performance Regression Visualization `maint/scripts/test_perf_regression.py`	Enhanced `draw()` function with type annotations, input validation, comprehensive Matplotlib-based visualization (horizontal bar charts, color-coding, annotations, summary box), replaced seaborn with pure Matplotlib styling.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Auth as Permission Check
    participant Setup as Environment Setup
    participant PR as PR Environment
    participant Main as Main/Baseline Environment
    participant Test as Regression Tests
    participant Report as Result Reporter
    
    GH->>Auth: Verify user authorization
    alt Unauthorized
        Auth-->>GH: Reject workflow
    else Authorized
        Auth-->>GH: Permit continuation
        par Self-Hosted Setup
            Setup->>Setup: Mask secrets
            Setup->>Setup: Configure cache dirs<br/>(XDG, PIP, UV, etc.)
            Setup->>Setup: Disable ccache
        and GitHub-Hosted Setup
            Setup->>Setup: Setup ccache
            Setup->>Setup: Enable CUDA config
            Setup->>Setup: Configure core dumps
        end
        
        Setup->>PR: Create PR environment<br/>(uv venv)
        Setup->>Main: Create main environment<br/>(uv venv)
        
        PR->>Test: Activate PR env
        Test->>Test: Run regression tests
        
        Main->>Test: Activate main env
        Test->>Test: Run baseline tests
        
        Test->>Report: Collect latency<br/>(with retry logic)
        alt Latency ≤ 0 after retries
            Report->>Report: Emit warning
        else Valid latency
            Report->>Report: Record result
        end
        
        Report->>Report: Generate visualization
        Report->>GH: Upload artifacts
        Report->>GH: Post PR comment
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[Bugfix][CI] Fix concurrency bug in regression test workflow #1500 — Modifies the same GitHub Actions workflow file with concurrency group adjustments.
[CI] Add preformance regression test script #1489 — Enhances the same performance-regression tooling and CI workflow infrastructure.
[Enhancement] Enhance and add new GQA backward examples for Hopper #930 — Updates the same flash_attention example files with benchmarking and backward kernel modifications.

Suggested reviewers

LeiWang1999

🐰 The workflows now dance with runners galore,
Self-hosted and GitHub, each one with its store,
Environments birthed, like carrots so bright,
Retry the tests when latency's not quite right,
With charts painted fresh, the results take their flight! 🎨✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: GitHub Actions workflow improvements (permissions check) and performance regression testing refinements across multiple example files and test utilities.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/pr-regression-test-bot.yml (1)
8-11: Add actions: write permission to enable cache saving.

The permissions block only sets contents: read, which implicitly sets unspecified permissions (like actions) to none. This prevents actions/cache and setup-uv's caching features from saving caches—they will restore but not save.

Based on learnings, the actions: write permission is required for cache operations in GitHub Actions.
Proposed fix
 permissions:
   contents: read
+  actions: write
   issues: write
   pull-requests: write

🧹 Nitpick comments (5)

examples/flash_attention/example_mha_fwd_varlen.py (1)

258-259: Consider aligning comment with code or explaining the discrepancy.

The comment recommends (128, 128, 2or3, 256) for Hopper, but the code uses (64, 64, 1, 128). If the main function intentionally uses more conservative parameters for broader GPU compatibility, consider updating the comment to clarify this distinction from the Hopper-optimized parameters now used in run_regression_perf.
tilelang/testing/perf_regression.py (1)
64-81: Consider adjusting stacklevel in warnings.warn.

The stacklevel=1 points to the warnings.warn call itself. To point to the caller of process_func (more useful for debugging), use stacklevel=2.
Proposed fix
     if latency <= 0.0:
-        warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, 1)
+        warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, stacklevel=2)
         return
maint/scripts/test_perf_regression.py (2)
149-150: Consider using .loc[] instead of .iloc[] for clarity.

After reset_index(drop=True), idxmax()/idxmin() return index labels (which happen to be positional integers 0..n-1). Using .iloc[] works here because the labels equal positions, but .loc[] is semantically correct when working with index labels.
Proposed fix
-    best = df.iloc[df["Speedup"].idxmax()]
-    worst = df.iloc[df["Speedup"].idxmin()]
+    best = df.loc[df["Speedup"].idxmax()]
+    worst = df.loc[df["Speedup"].idxmin()]
73-82: Consider scoping rcParams changes to avoid global side effects.

Modifying plt.rcParams globally affects all subsequent plots in the same process. Consider using a context manager.
Proposed fix using context manager
-    plt.rcParams.update(
-        {
-            "figure.dpi": 120,
-            "savefig.dpi": 300,
-            "axes.titlesize": 16,
-            "axes.labelsize": 12,
-            "xtick.labelsize": 10,
-            "ytick.labelsize": 10,
-        }
-    )
+    with plt.rc_context({
+        "figure.dpi": 120,
+        "savefig.dpi": 300,
+        "axes.titlesize": 16,
+        "axes.labelsize": 12,
+        "xtick.labelsize": 10,
+        "ytick.labelsize": 10,
+    }):
+        # ... rest of the function body indented here
Note: This would require restructuring the function to place remaining code inside the with block.
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)
345-350: Optional: dO.requires_grad_() is unnecessary.

Line 350 sets requires_grad_() on dO, but upstream gradients passed to .backward() don't need gradients themselves. This is harmless but adds unnecessary overhead.
🔎 Proposed simplification
-    dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_().requires_grad_()
+    dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_()

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feb106b and b8dbcf8.

📒 Files selected for processing (10)

.github/workflows/pr-regression-test-bot.yml
examples/dynamic_shape/example_dynamic.py
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
examples/flash_attention/example_gqa_fwd_bshd.py
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py
examples/flash_attention/example_mha_fwd_varlen.py
examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py
examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py
maint/scripts/test_perf_regression.py
tilelang/testing/perf_regression.py

💤 Files with no reviewable changes (2)

examples/flash_attention/example_gqa_fwd_bshd.py
examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

📚 Learning: 2025-11-14T07:56:11.098Z

Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

examples/dynamic_shape/example_dynamic.py

🧬 Code graph analysis (3)

examples/flash_attention/example_mha_fwd_varlen.py (2)

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)

flashattn (32-170)

examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py (1)

flashattn (23-149)

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)

tilelang/profiler/__init__.py (1)

do_bench (209-269)

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (5)

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)

run_regression_perf (233-244)

examples/flash_attention/example_mha_bwd_bshd_wgmma_pipelined.py (3)

run_regression_perf (323-352)

run1 (312-313)

backward (235-256)

examples/flash_attention/example_mha_bwd_bshd.py (3)

run_regression_perf (346-374)

run1 (333-334)

backward (256-278)

examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py (2)

run1 (681-682)

backward (531-602)

examples/flash_attention/example_gqa_bwd_tma_reduce.py (2)

run1 (497-498)

backward (379-426)

🪛 Ruff (0.14.10)

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

351-351: Ambiguous variable name: O

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (16)

examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py (1)

230-230: LGTM! Correct migration to type constants.

The changes properly replace string literals with type constants (T.float8_e4m3fn and T.float8_e5m2), which aligns with the tl_matmul function signature and its dtype assertions at lines 41-46.

Also applies to: 234-234

examples/dynamic_shape/example_dynamic.py (1)

7-7: Decorator simplification is safe—configurations removed are not used elsewhere in the codebase.

The two pass configurations (tl.disable_dynamic_tail_split and tl.dynamic_alignment) removed from the decorator were unique to this file and do not appear in any other example, test, or kernel in the repository. The file retains its correctness validation (torch.testing.assert_close at line 90) and performance measurement capability (run_regression_perf at lines 108–128). The simplification aligns with the PR's goal of refining performance regression testing.

examples/flash_attention/example_mha_fwd_varlen.py (1)

338-338: Add runtime verification that regression test GPU is Hopper architecture.

The parameters block_M=128, block_N=128, num_stages=2, threads=256 are documented as Hopper-recommended (line 259), but the regression test applies them unconditionally without verifying the GPU architecture. Other examples in this codebase (e.g., example_convolution.py, example_conv_analyze.py) implement check_hopper() functions before using architecture-specific parameters. Ensure a similar check is added here to guarantee valid performance baselines.

tilelang/testing/perf_regression.py (2)

12-12: LGTM!

Adding warnings import to support structured warning emission is appropriate.

35-35: LGTM!

Module-level constant for retry limit is a clean approach.

maint/scripts/test_perf_regression.py (2)

8-9: LGTM!

Adding numpy and textwrap imports for enhanced visualization logic is appropriate.

56-57: LGTM!

Good defensive check for None or empty DataFrame before proceeding with visualization.

.github/workflows/pr-regression-test-bot.yml (7)

53-67: Good security practice with permission gating.

Checking collaborator permissions before running expensive operations prevents abuse of the @regression-perf trigger by unauthorized users.

76-94: LGTM!

Good approach to use runner.tool_cache for self-hosted runners to share caches between jobs and avoid repeated downloads. Secret masking adds a security layer.

112-146: LGTM!

CUDA environment setup with dynamic version parsing and fallback PATH configuration is well-structured. The nvcc presence check with warning helps identify misconfigured runners.

148-164: LGTM!

The uv setup properly differentiates between self-hosted and GitHub-hosted runners for caching behavior. Cache suffix includes all relevant matrix dimensions.

166-187: LGTM!

Clean separation between PR and baseline environments. The git clean step correctly preserves the PR installation while preparing for baseline checkout.

200-214: LGTM!

Core dump generation is appropriately limited to GitHub-hosted runners where you control the environment. Including version/toolkit in the pattern helps identify which configuration produced a crash.

236-264: LGTM!

Reading the markdown file directly (line 243) avoids escaping issues. The comment body is well-structured with attribution, workflow link, and artifact instructions.

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)

233-240: LGTM: Simplified function signature.

Removing the unused tune parameter from run_regression_perf() simplifies the API. Verified that no callers pass this parameter to the function—the regression test suite calls it without tune, and no other imports were found using it.

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)

342-344: LGTM: Clear signature with explicit type hints.

The updated signature with typed parameters and sensible defaults improves clarity for regression testing.

coderabbitai · 2025-12-24T06:41:36Z

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

+    O = attention(Q, K, V, causal, groups)
+
+    def run1():
+        O.backward(dO, retain_graph=True)

-    def run_kernel_only():
-        kernel(Q, K, V, dO, lse, Delta, dQ, dK, dV)
+    from tilelang.profiler import do_bench

-    return do_bench(run_kernel_only, warmup=10, rep=100, backend="cupti")
+    return do_bench(run1, warmup=500, backend="cupti")


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find the do_bench implementation echo "=== Searching for do_bench implementation ===" find . -name "*.py" -type f | xargs grep -l "def do_bench" | head -5 # Search specifically in tilelang/profiler find . -path "*tilelang/profiler*" -name "*.py" -type f # Show the do_bench implementation echo -e "\n=== do_bench implementation ===" rg -A 20 "def do_bench" --type py

Repository: tile-ai/tilelang

Length of output: 9418

🏁 Script executed:

#!/bin/bash # Check the actual file at the specified lines echo "=== Checking example_gqa_bwd_wgmma_pipelined.py context ===" wc -l examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 183

🏁 Script executed:

#!/bin/bash # Get the context around lines 334 and 358 in the target file echo "=== Context around line 334 (main function) ===" sed -n '330,340p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py echo -e "\n=== Context around line 358 (do_bench call) ===" sed -n '350,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py # Also check the full return statement and benchmark section echo -e "\n=== Full benchmarking section context ===" sed -n '345,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 1594

🏁 Script executed:

#!/bin/bash # Search for backend parameter usage patterns in flash_attention examples echo "=== Backend usage in flash_attention examples ===" rg -n -B 2 -A 2 'backend\s*=' examples/flash_attention/ --type py

Repository: tile-ai/tilelang

Length of output: 6228

🏁 Script executed:

#!/bin/bash # Get the full do_bench implementation from bench.py echo "=== Full do_bench implementation ===" sed -n '1,150p' tilelang/profiler/bench.py | head -100 # Also check if there's any gradient-clearing logic echo -e "\n=== Checking for grad clearing logic in bench.py ===" rg -n "grad|zero" tilelang/profiler/bench.py

Repository: tile-ai/tilelang

Length of output: 3891

🏁 Script executed:

#!/bin/bash # Let's see the entire bench.py file to understand gradient handling echo "=== bench.py file info ===" wc -l tilelang/profiler/bench.py echo -e "\n=== Full bench.py content ===" cat -n tilelang/profiler/bench.py

Repository: tile-ai/tilelang

Length of output: 8776

🏁 Script executed:

#!/bin/bash # Check the actual run function definition and how do_bench uses it echo "=== Context around lines 333-345 in the target file ===" sed -n '325,345p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 873

Clear gradients between backward iterations for consistent timing.

The run1() closure calls O.backward(dO, retain_graph=True) repeatedly without clearing gradients. Since do_bench() does not zero gradients between iterations, Q.grad, K.grad, and V.grad will accumulate values across warmup and benchmark iterations. This causes timing variations between the first iteration (gradient buffer allocation) and subsequent iterations, contaminating the measurement.

Add Q.grad = None; K.grad = None; V.grad = None before the backward call in run1(), or use torch.no_grad() if gradient tracking isn't needed for the benchmark.

The backend="cupti" choice is intentional and consistent with other backward examples in the codebase.

🧰 Tools

🪛 Ruff (0.14.10)

351-351: Ambiguous variable name: O

(E741)

🤖 Prompt for AI Agents

In examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py around lines 351 to 358 the benchmark closure repeatedly calls O.backward(dO, retain_graph=True) without clearing gradients, causing Q.grad/K.grad/V.grad to accumulate and skew timing; before the backward call, clear gradients by setting Q.grad = None; K.grad = None; V.grad = None (or wrap the benchmark in torch.no_grad() if appropriate) so each iteration starts with fresh gradient buffers and produces consistent timing measurements.

coderabbitai · 2025-12-24T06:41:36Z

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py


    profiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)
-    return profiler.do_bench(warmup=500, backend="cupti")
+    return profiler.do_bench(backend="cupti")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the file to see the code around lines 216-244 cat -n examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py | sed -n '210,250p'

Repository: tile-ai/tilelang

Length of output: 2204

🏁 Script executed:

# Find the do_bench method definition rg -n "def do_bench" --type=py -B 2 -A 10

Repository: tile-ai/tilelang

Length of output: 5613

🏁 Script executed:

# Search for CUPTI references in GitHub Actions workflows and CI configuration fd -e yml -e yaml . .github/workflows/ 2>/dev/null --exec rg -l 'cupti|CUPTI' \; # Also check for any CI configuration files fd -e yml -e yaml -e toml -e cfg . -maxdepth 3 2>/dev/null --exec rg -l 'cupti|CUPTI' \;

Repository: tile-ai/tilelang

Length of output: 42

🏁 Script executed:

# Let's check what CI workflows exist fd -e yml -e yaml . .github/workflows/ 2>/dev/null # Also search for any CUPTI references in the entire codebase rg -i 'cupti' --type=py --type=yml

Repository: tile-ai/tilelang

Length of output: 274

🏁 Script executed:

# Search for CUPTI references in CI workflows without type restriction rg -i 'cupti' .github/workflows/ # Also check the profiler backend configuration to understand CUPTI usage rg -i 'cupti' examples/flash_attention/ # Check if there are any requirements or setup for CUPTI rg -i 'cupti\|profiler' .github/workflows/pr-regression-test-bot.yml

Repository: tile-ai/tilelang

Length of output: 1570

🏁 Script executed:

# Check the pr-regression-test-bot.yml workflow to see what environment is used cat .github/workflows/pr-regression-test-bot.yml # Also check ci.yml for any CUPTI setup cat .github/workflows/ci.yml

Repository: tile-ai/tilelang

Length of output: 30049

🏁 Script executed:

# Let's check the regression test script to understand how it's used cat maint/scripts/test_perf_regression.py | head -100 # Also check for similar patterns in other flash attention files rg -n "warmup=" examples/flash_attention/ -A 1 -B 1 | head -60

Repository: tile-ai/tilelang

Length of output: 8040

🏁 Script executed:

# Check if there are environment considerations or failures related to CUPTI in tests rg -n "CUPTI|cupti|backend.*cupti" testing/ --type py -B 2 -A 2 # Let's also check similar regression tests to understand the pattern better grep -r "def run_regression_perf" examples/flash_attention/ -A 10

Repository: tile-ai/tilelang

Length of output: 14240

Add warmup parameter to match main() profiling for consistent regression testing.

The run_regression_perf() invocation at line 244 omits the warmup parameter, which defaults to 25ms, while the main() function uses warmup=500 (lines 216, 219). This inconsistency degrades measurement reliability; flash attention kernels require adequate warmup to reach steady state and exclude JIT compilation overhead. Using insufficient warmup in regression testing produces unstable performance numbers, making reliable performance comparisons impossible.

Similar functions in the codebase (e.g., example_mha_fwd_bshd_wgmma_pipelined.py) correctly specify warmup=500 in their regression tests.

- return profiler.do_bench(backend="cupti") + return profiler.do_bench(warmup=500, backend="cupti")

🤖 Prompt for AI Agents

In examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py around line 244, the run_regression_perf() call omits the warmup parameter causing inconsistent and unstable profiling; update the call to pass warmup=500 (match main() lines ~216–219) so the regression benchmark uses the same warmup duration as main(), ensuring sufficient warmup for flash-attention kernels and consistent measurements.

coderabbitai · 2025-12-24T06:41:36Z

examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py


 def run_regression_perf():
    M, N, K = 128, 128, 128
    out_dtype, accum_dtype = "float32", "float32"


⚠️ Potential issue | 🔴 Critical

Critical: Inconsistent dtype specification causes assertion failure.

Line 229 uses string literals "float32" for out_dtype and accum_dtype, but tl_matmul expects type constants. When these strings are passed to tl_matmul at lines 231 and 235, they will fail the assertion at line 47-51, which checks out_dtype in [T.float16, T.float32, T.int32]. The string "float32" is not equal to the type constant T.float32.

🔎 Proposed fix

- out_dtype, accum_dtype = "float32", "float32" + out_dtype, accum_dtype = T.float32, T.float32

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

out_dtype, accum_dtype = "float32", "float32"

out_dtype, accum_dtype = T.float32, T.float32

🤖 Prompt for AI Agents

In examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py around line 229 (and the subsequent tl_matmul calls at ~231 and ~235), out_dtype and accum_dtype are set to the string "float32" but tl_matmul expects type constants (e.g., T.float32). Replace the string literals with the proper type constants (for example: out_dtype = T.float32, accum_dtype = T.float32) so the values passed into tl_matmul match the assertion checks; ensure any other places that used those variables expect the same T.* constants.

[Enhancement] Improve GitHub Actions permissions check and refine per…

b8dbcf8

…formance regression testing

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

LeiWang1999 approved these changes Dec 24, 2025

View reviewed changes

LeiWang1999 merged commit d7e264f into tile-ai:main Dec 24, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519

[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519

xwhzz commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 24, 2025

Uh oh!

coderabbitai bot Dec 24, 2025

Uh oh!

coderabbitai bot Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	out_dtype, accum_dtype = "float32", "float32"
	out_dtype, accum_dtype = T.float32, T.float32

[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519

[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519

Conversation

xwhzz commented Dec 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xwhzz commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 24, 2025 •

edited

Loading