-
Notifications
You must be signed in to change notification settings - Fork 357
[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing #1519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…formance regression testing
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughComprehensive refactor of regression testing workflow and tooling: GitHub Actions workflow now supports dynamic matrix-based runners with environment isolation and CUDA configuration, examples updated with simplified benchmarking signatures and kernel parameters, and test infrastructure enhanced with retry logic and Matplotlib-based result visualization. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant Auth as Permission Check
participant Setup as Environment Setup
participant PR as PR Environment
participant Main as Main/Baseline Environment
participant Test as Regression Tests
participant Report as Result Reporter
GH->>Auth: Verify user authorization
alt Unauthorized
Auth-->>GH: Reject workflow
else Authorized
Auth-->>GH: Permit continuation
par Self-Hosted Setup
Setup->>Setup: Mask secrets
Setup->>Setup: Configure cache dirs<br/>(XDG, PIP, UV, etc.)
Setup->>Setup: Disable ccache
and GitHub-Hosted Setup
Setup->>Setup: Setup ccache
Setup->>Setup: Enable CUDA config
Setup->>Setup: Configure core dumps
end
Setup->>PR: Create PR environment<br/>(uv venv)
Setup->>Main: Create main environment<br/>(uv venv)
PR->>Test: Activate PR env
Test->>Test: Run regression tests
Main->>Test: Activate main env
Test->>Test: Run baseline tests
Test->>Report: Collect latency<br/>(with retry logic)
alt Latency ≤ 0 after retries
Report->>Report: Emit warning
else Valid latency
Report->>Report: Record result
end
Report->>Report: Generate visualization
Report->>GH: Upload artifacts
Report->>GH: Post PR comment
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/pr-regression-test-bot.yml (1)
8-11: Addactions: writepermission to enable cache saving.The
permissionsblock only setscontents: read, which implicitly sets unspecified permissions (likeactions) tonone. This preventsactions/cacheandsetup-uv's caching features from saving caches—they will restore but not save.Based on learnings, the
actions: writepermission is required for cache operations in GitHub Actions.Proposed fix
permissions: contents: read + actions: write issues: write pull-requests: write
🧹 Nitpick comments (5)
examples/flash_attention/example_mha_fwd_varlen.py (1)
258-259: Consider aligning comment with code or explaining the discrepancy.The comment recommends
(128, 128, 2or3, 256)for Hopper, but the code uses(64, 64, 1, 128). If themainfunction intentionally uses more conservative parameters for broader GPU compatibility, consider updating the comment to clarify this distinction from the Hopper-optimized parameters now used inrun_regression_perf.tilelang/testing/perf_regression.py (1)
64-81: Consider adjustingstacklevelinwarnings.warn.The
stacklevel=1points to thewarnings.warncall itself. To point to the caller ofprocess_func(more useful for debugging), usestacklevel=2.Proposed fix
if latency <= 0.0: - warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, 1) + warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, stacklevel=2) returnmaint/scripts/test_perf_regression.py (2)
149-150: Consider using.loc[]instead of.iloc[]for clarity.After
reset_index(drop=True),idxmax()/idxmin()return index labels (which happen to be positional integers 0..n-1). Using.iloc[]works here because the labels equal positions, but.loc[]is semantically correct when working with index labels.Proposed fix
- best = df.iloc[df["Speedup"].idxmax()] - worst = df.iloc[df["Speedup"].idxmin()] + best = df.loc[df["Speedup"].idxmax()] + worst = df.loc[df["Speedup"].idxmin()]
73-82: Consider scopingrcParamschanges to avoid global side effects.Modifying
plt.rcParamsglobally affects all subsequent plots in the same process. Consider using a context manager.Proposed fix using context manager
- plt.rcParams.update( - { - "figure.dpi": 120, - "savefig.dpi": 300, - "axes.titlesize": 16, - "axes.labelsize": 12, - "xtick.labelsize": 10, - "ytick.labelsize": 10, - } - ) + with plt.rc_context({ + "figure.dpi": 120, + "savefig.dpi": 300, + "axes.titlesize": 16, + "axes.labelsize": 12, + "xtick.labelsize": 10, + "ytick.labelsize": 10, + }): + # ... rest of the function body indented hereNote: This would require restructuring the function to place remaining code inside the
withblock.examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)
345-350: Optional:dO.requires_grad_()is unnecessary.Line 350 sets
requires_grad_()ondO, but upstream gradients passed to.backward()don't need gradients themselves. This is harmless but adds unnecessary overhead.🔎 Proposed simplification
- dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_().requires_grad_() + dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.github/workflows/pr-regression-test-bot.ymlexamples/dynamic_shape/example_dynamic.pyexamples/flash_attention/example_gqa_bwd_wgmma_pipelined.pyexamples/flash_attention/example_gqa_fwd_bshd.pyexamples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.pyexamples/flash_attention/example_mha_fwd_varlen.pyexamples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.pyexamples/gemm_streamk/regression_example_tilelang_gemm_splitk.pymaint/scripts/test_perf_regression.pytilelang/testing/perf_regression.py
💤 Files with no reviewable changes (2)
- examples/flash_attention/example_gqa_fwd_bshd.py
- examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
examples/dynamic_shape/example_dynamic.py
🧬 Code graph analysis (3)
examples/flash_attention/example_mha_fwd_varlen.py (2)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
flashattn(32-170)examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py (1)
flashattn(23-149)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
tilelang/profiler/__init__.py (1)
do_bench(209-269)
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (5)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
run_regression_perf(233-244)examples/flash_attention/example_mha_bwd_bshd_wgmma_pipelined.py (3)
run_regression_perf(323-352)run1(312-313)backward(235-256)examples/flash_attention/example_mha_bwd_bshd.py (3)
run_regression_perf(346-374)run1(333-334)backward(256-278)examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py (2)
run1(681-682)backward(531-602)examples/flash_attention/example_gqa_bwd_tma_reduce.py (2)
run1(497-498)backward(379-426)
🪛 Ruff (0.14.10)
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
351-351: Ambiguous variable name: O
(E741)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (16)
examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py (1)
230-230: LGTM! Correct migration to type constants.The changes properly replace string literals with type constants (
T.float8_e4m3fnandT.float8_e5m2), which aligns with thetl_matmulfunction signature and its dtype assertions at lines 41-46.Also applies to: 234-234
examples/dynamic_shape/example_dynamic.py (1)
7-7: Decorator simplification is safe—configurations removed are not used elsewhere in the codebase.The two pass configurations (
tl.disable_dynamic_tail_splitandtl.dynamic_alignment) removed from the decorator were unique to this file and do not appear in any other example, test, or kernel in the repository. The file retains its correctness validation (torch.testing.assert_closeat line 90) and performance measurement capability (run_regression_perfat lines 108–128). The simplification aligns with the PR's goal of refining performance regression testing.examples/flash_attention/example_mha_fwd_varlen.py (1)
338-338: Add runtime verification that regression test GPU is Hopper architecture.The parameters
block_M=128, block_N=128, num_stages=2, threads=256are documented as Hopper-recommended (line 259), but the regression test applies them unconditionally without verifying the GPU architecture. Other examples in this codebase (e.g.,example_convolution.py,example_conv_analyze.py) implementcheck_hopper()functions before using architecture-specific parameters. Ensure a similar check is added here to guarantee valid performance baselines.tilelang/testing/perf_regression.py (2)
12-12: LGTM!Adding
warningsimport to support structured warning emission is appropriate.
35-35: LGTM!Module-level constant for retry limit is a clean approach.
maint/scripts/test_perf_regression.py (2)
8-9: LGTM!Adding
numpyandtextwrapimports for enhanced visualization logic is appropriate.
56-57: LGTM!Good defensive check for
Noneor empty DataFrame before proceeding with visualization..github/workflows/pr-regression-test-bot.yml (7)
53-67: Good security practice with permission gating.Checking collaborator permissions before running expensive operations prevents abuse of the
@regression-perftrigger by unauthorized users.
76-94: LGTM!Good approach to use
runner.tool_cachefor self-hosted runners to share caches between jobs and avoid repeated downloads. Secret masking adds a security layer.
112-146: LGTM!CUDA environment setup with dynamic version parsing and fallback PATH configuration is well-structured. The nvcc presence check with warning helps identify misconfigured runners.
148-164: LGTM!The uv setup properly differentiates between self-hosted and GitHub-hosted runners for caching behavior. Cache suffix includes all relevant matrix dimensions.
166-187: LGTM!Clean separation between PR and baseline environments. The
git cleanstep correctly preserves the PR installation while preparing for baseline checkout.
200-214: LGTM!Core dump generation is appropriately limited to GitHub-hosted runners where you control the environment. Including version/toolkit in the pattern helps identify which configuration produced a crash.
236-264: LGTM!Reading the markdown file directly (line 243) avoids escaping issues. The comment body is well-structured with attribution, workflow link, and artifact instructions.
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
233-240: LGTM: Simplified function signature.Removing the unused
tuneparameter fromrun_regression_perf()simplifies the API. Verified that no callers pass this parameter to the function—the regression test suite calls it withouttune, and no other imports were found using it.examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)
342-344: LGTM: Clear signature with explicit type hints.The updated signature with typed parameters and sensible defaults improves clarity for regression testing.
| O = attention(Q, K, V, causal, groups) | ||
|
|
||
| def run1(): | ||
| O.backward(dO, retain_graph=True) | ||
|
|
||
| def run_kernel_only(): | ||
| kernel(Q, K, V, dO, lse, Delta, dQ, dK, dV) | ||
| from tilelang.profiler import do_bench | ||
|
|
||
| return do_bench(run_kernel_only, warmup=10, rep=100, backend="cupti") | ||
| return do_bench(run1, warmup=500, backend="cupti") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's find the do_bench implementation
echo "=== Searching for do_bench implementation ==="
find . -name "*.py" -type f | xargs grep -l "def do_bench" | head -5
# Search specifically in tilelang/profiler
find . -path "*tilelang/profiler*" -name "*.py" -type f
# Show the do_bench implementation
echo -e "\n=== do_bench implementation ==="
rg -A 20 "def do_bench" --type pyRepository: tile-ai/tilelang
Length of output: 9418
🏁 Script executed:
#!/bin/bash
# Check the actual file at the specified lines
echo "=== Checking example_gqa_bwd_wgmma_pipelined.py context ==="
wc -l examples/flash_attention/example_gqa_bwd_wgmma_pipelined.pyRepository: tile-ai/tilelang
Length of output: 183
🏁 Script executed:
#!/bin/bash
# Get the context around lines 334 and 358 in the target file
echo "=== Context around line 334 (main function) ==="
sed -n '330,340p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
echo -e "\n=== Context around line 358 (do_bench call) ==="
sed -n '350,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
# Also check the full return statement and benchmark section
echo -e "\n=== Full benchmarking section context ==="
sed -n '345,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.pyRepository: tile-ai/tilelang
Length of output: 1594
🏁 Script executed:
#!/bin/bash
# Search for backend parameter usage patterns in flash_attention examples
echo "=== Backend usage in flash_attention examples ==="
rg -n -B 2 -A 2 'backend\s*=' examples/flash_attention/ --type pyRepository: tile-ai/tilelang
Length of output: 6228
🏁 Script executed:
#!/bin/bash
# Get the full do_bench implementation from bench.py
echo "=== Full do_bench implementation ==="
sed -n '1,150p' tilelang/profiler/bench.py | head -100
# Also check if there's any gradient-clearing logic
echo -e "\n=== Checking for grad clearing logic in bench.py ==="
rg -n "grad|zero" tilelang/profiler/bench.pyRepository: tile-ai/tilelang
Length of output: 3891
🏁 Script executed:
#!/bin/bash
# Let's see the entire bench.py file to understand gradient handling
echo "=== bench.py file info ==="
wc -l tilelang/profiler/bench.py
echo -e "\n=== Full bench.py content ==="
cat -n tilelang/profiler/bench.pyRepository: tile-ai/tilelang
Length of output: 8776
🏁 Script executed:
#!/bin/bash
# Check the actual run function definition and how do_bench uses it
echo "=== Context around lines 333-345 in the target file ==="
sed -n '325,345p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.pyRepository: tile-ai/tilelang
Length of output: 873
Clear gradients between backward iterations for consistent timing.
The run1() closure calls O.backward(dO, retain_graph=True) repeatedly without clearing gradients. Since do_bench() does not zero gradients between iterations, Q.grad, K.grad, and V.grad will accumulate values across warmup and benchmark iterations. This causes timing variations between the first iteration (gradient buffer allocation) and subsequent iterations, contaminating the measurement.
Add Q.grad = None; K.grad = None; V.grad = None before the backward call in run1(), or use torch.no_grad() if gradient tracking isn't needed for the benchmark.
The backend="cupti" choice is intentional and consistent with other backward examples in the codebase.
🧰 Tools
🪛 Ruff (0.14.10)
351-351: Ambiguous variable name: O
(E741)
🤖 Prompt for AI Agents
In examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py around lines 351
to 358 the benchmark closure repeatedly calls O.backward(dO, retain_graph=True)
without clearing gradients, causing Q.grad/K.grad/V.grad to accumulate and skew
timing; before the backward call, clear gradients by setting Q.grad = None;
K.grad = None; V.grad = None (or wrap the benchmark in torch.no_grad() if
appropriate) so each iteration starts with fresh gradient buffers and produces
consistent timing measurements.
|
|
||
| profiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal) | ||
| return profiler.do_bench(warmup=500, backend="cupti") | ||
| return profiler.do_bench(backend="cupti") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the file to see the code around lines 216-244
cat -n examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py | sed -n '210,250p'Repository: tile-ai/tilelang
Length of output: 2204
🏁 Script executed:
# Find the do_bench method definition
rg -n "def do_bench" --type=py -B 2 -A 10Repository: tile-ai/tilelang
Length of output: 5613
🏁 Script executed:
# Search for CUPTI references in GitHub Actions workflows and CI configuration
fd -e yml -e yaml . .github/workflows/ 2>/dev/null --exec rg -l 'cupti|CUPTI' \;
# Also check for any CI configuration files
fd -e yml -e yaml -e toml -e cfg . -maxdepth 3 2>/dev/null --exec rg -l 'cupti|CUPTI' \;Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Let's check what CI workflows exist
fd -e yml -e yaml . .github/workflows/ 2>/dev/null
# Also search for any CUPTI references in the entire codebase
rg -i 'cupti' --type=py --type=ymlRepository: tile-ai/tilelang
Length of output: 274
🏁 Script executed:
# Search for CUPTI references in CI workflows without type restriction
rg -i 'cupti' .github/workflows/
# Also check the profiler backend configuration to understand CUPTI usage
rg -i 'cupti' examples/flash_attention/
# Check if there are any requirements or setup for CUPTI
rg -i 'cupti\|profiler' .github/workflows/pr-regression-test-bot.ymlRepository: tile-ai/tilelang
Length of output: 1570
🏁 Script executed:
# Check the pr-regression-test-bot.yml workflow to see what environment is used
cat .github/workflows/pr-regression-test-bot.yml
# Also check ci.yml for any CUPTI setup
cat .github/workflows/ci.ymlRepository: tile-ai/tilelang
Length of output: 30049
🏁 Script executed:
# Let's check the regression test script to understand how it's used
cat maint/scripts/test_perf_regression.py | head -100
# Also check for similar patterns in other flash attention files
rg -n "warmup=" examples/flash_attention/ -A 1 -B 1 | head -60Repository: tile-ai/tilelang
Length of output: 8040
🏁 Script executed:
# Check if there are environment considerations or failures related to CUPTI in tests
rg -n "CUPTI|cupti|backend.*cupti" testing/ --type py -B 2 -A 2
# Let's also check similar regression tests to understand the pattern better
grep -r "def run_regression_perf" examples/flash_attention/ -A 10Repository: tile-ai/tilelang
Length of output: 14240
Add warmup parameter to match main() profiling for consistent regression testing.
The run_regression_perf() invocation at line 244 omits the warmup parameter, which defaults to 25ms, while the main() function uses warmup=500 (lines 216, 219). This inconsistency degrades measurement reliability; flash attention kernels require adequate warmup to reach steady state and exclude JIT compilation overhead. Using insufficient warmup in regression testing produces unstable performance numbers, making reliable performance comparisons impossible.
Similar functions in the codebase (e.g., example_mha_fwd_bshd_wgmma_pipelined.py) correctly specify warmup=500 in their regression tests.
- return profiler.do_bench(backend="cupti")
+ return profiler.do_bench(warmup=500, backend="cupti")🤖 Prompt for AI Agents
In examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py around line
244, the run_regression_perf() call omits the warmup parameter causing
inconsistent and unstable profiling; update the call to pass warmup=500 (match
main() lines ~216–219) so the regression benchmark uses the same warmup duration
as main(), ensuring sufficient warmup for flash-attention kernels and consistent
measurements.
|
|
||
| def run_regression_perf(): | ||
| M, N, K = 128, 128, 128 | ||
| out_dtype, accum_dtype = "float32", "float32" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Inconsistent dtype specification causes assertion failure.
Line 229 uses string literals "float32" for out_dtype and accum_dtype, but tl_matmul expects type constants. When these strings are passed to tl_matmul at lines 231 and 235, they will fail the assertion at line 47-51, which checks out_dtype in [T.float16, T.float32, T.int32]. The string "float32" is not equal to the type constant T.float32.
🔎 Proposed fix
- out_dtype, accum_dtype = "float32", "float32"
+ out_dtype, accum_dtype = T.float32, T.float32📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| out_dtype, accum_dtype = "float32", "float32" | |
| out_dtype, accum_dtype = T.float32, T.float32 |
🤖 Prompt for AI Agents
In examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py around line 229 (and
the subsequent tl_matmul calls at ~231 and ~235), out_dtype and accum_dtype are
set to the string "float32" but tl_matmul expects type constants (e.g.,
T.float32). Replace the string literals with the proper type constants (for
example: out_dtype = T.float32, accum_dtype = T.float32) so the values passed
into tl_matmul match the assertion checks; ensure any other places that used
those variables expect the same T.* constants.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Refactoring
✏️ Tip: You can customize this high-level summary in your review settings.