Skip to content

Conversation

@xwhzz
Copy link
Contributor

@xwhzz xwhzz commented Dec 24, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced PR regression testing workflow with improved multi-runner support and dynamic caching configuration.
    • Improved performance regression analysis visualization with detailed results reporting.
  • Bug Fixes

    • Enhanced error handling for performance metric collection with automatic retry mechanism.
  • Refactoring

    • Updated example benchmarks and test configurations for improved performance analysis.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

Comprehensive refactor of regression testing workflow and tooling: GitHub Actions workflow now supports dynamic matrix-based runners with environment isolation and CUDA configuration, examples updated with simplified benchmarking signatures and kernel parameters, and test infrastructure enhanced with retry logic and Matplotlib-based result visualization.

Changes

Cohort / File(s) Summary
CI/CD Workflow
.github/workflows/pr-regression-test-bot.yml
Refactored with matrix-based multi-runner strategy, permission checks, self-hosted and GitHub-hosted environment setup (cache dirs, ccache, CUDA), unified uv Python setup, isolated Python environments, baseline/cleanup logic, core-dump configuration, extended post-test steps with artifact upload and PR commenting.
Flash Attention Examples
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py, examples/flash_attention/example_gqa_fwd_bshd.py, examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py, examples/flash_attention/example_mha_fwd_varlen.py
Updated run_regression_perf signatures with explicit typed parameters, simplified benchmarking flows (single autograd-based path in bwd), removed kernel-only benchmark paths, kernel configuration parameter changes (block_M/N, stages, threads for varlen).
GEMM Examples
examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py, examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py
FP8 example now uses tilelang tensor type constants instead of string literals; streamk regression example removed entirely.
Dynamic Shape Example
examples/dynamic_shape/example_dynamic.py
Removed JIT pass configurations, now uses plain @tilelang.jit with defaults.
Test Infrastructure
tilelang/testing/perf_regression.py
Added retry mechanism (_MAX_RETRY_NUM = 5) for non-positive latency collection, introduced structured warning via warnings module, changed process_func return type from float to None.
Performance Regression Visualization
maint/scripts/test_perf_regression.py
Enhanced draw() function with type annotations, input validation, comprehensive Matplotlib-based visualization (horizontal bar charts, color-coding, annotations, summary box), replaced seaborn with pure Matplotlib styling.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Auth as Permission Check
    participant Setup as Environment Setup
    participant PR as PR Environment
    participant Main as Main/Baseline Environment
    participant Test as Regression Tests
    participant Report as Result Reporter
    
    GH->>Auth: Verify user authorization
    alt Unauthorized
        Auth-->>GH: Reject workflow
    else Authorized
        Auth-->>GH: Permit continuation
        par Self-Hosted Setup
            Setup->>Setup: Mask secrets
            Setup->>Setup: Configure cache dirs<br/>(XDG, PIP, UV, etc.)
            Setup->>Setup: Disable ccache
        and GitHub-Hosted Setup
            Setup->>Setup: Setup ccache
            Setup->>Setup: Enable CUDA config
            Setup->>Setup: Configure core dumps
        end
        
        Setup->>PR: Create PR environment<br/>(uv venv)
        Setup->>Main: Create main environment<br/>(uv venv)
        
        PR->>Test: Activate PR env
        Test->>Test: Run regression tests
        
        Main->>Test: Activate main env
        Test->>Test: Run baseline tests
        
        Test->>Report: Collect latency<br/>(with retry logic)
        alt Latency ≤ 0 after retries
            Report->>Report: Emit warning
        else Valid latency
            Report->>Report: Record result
        end
        
        Report->>Report: Generate visualization
        Report->>GH: Upload artifacts
        Report->>GH: Post PR comment
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999

🐰 The workflows now dance with runners galore,
Self-hosted and GitHub, each one with its store,
Environments birthed, like carrots so bright,
Retry the tests when latency's not quite right,
With charts painted fresh, the results take their flight! 🎨✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: GitHub Actions workflow improvements (permissions check) and performance regression testing refinements across multiple example files and test utilities.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/pr-regression-test-bot.yml (1)

8-11: Add actions: write permission to enable cache saving.

The permissions block only sets contents: read, which implicitly sets unspecified permissions (like actions) to none. This prevents actions/cache and setup-uv's caching features from saving caches—they will restore but not save.

Based on learnings, the actions: write permission is required for cache operations in GitHub Actions.

Proposed fix
 permissions:
   contents: read
+  actions: write
   issues: write
   pull-requests: write
🧹 Nitpick comments (5)
examples/flash_attention/example_mha_fwd_varlen.py (1)

258-259: Consider aligning comment with code or explaining the discrepancy.

The comment recommends (128, 128, 2or3, 256) for Hopper, but the code uses (64, 64, 1, 128). If the main function intentionally uses more conservative parameters for broader GPU compatibility, consider updating the comment to clarify this distinction from the Hopper-optimized parameters now used in run_regression_perf.

tilelang/testing/perf_regression.py (1)

64-81: Consider adjusting stacklevel in warnings.warn.

The stacklevel=1 points to the warnings.warn call itself. To point to the caller of process_func (more useful for debugging), use stacklevel=2.

Proposed fix
     if latency <= 0.0:
-        warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, 1)
+        warnings.warn(f"{result_name} has latency {latency} <= 0. Please verify the profiling results.", RuntimeWarning, stacklevel=2)
         return
maint/scripts/test_perf_regression.py (2)

149-150: Consider using .loc[] instead of .iloc[] for clarity.

After reset_index(drop=True), idxmax()/idxmin() return index labels (which happen to be positional integers 0..n-1). Using .iloc[] works here because the labels equal positions, but .loc[] is semantically correct when working with index labels.

Proposed fix
-    best = df.iloc[df["Speedup"].idxmax()]
-    worst = df.iloc[df["Speedup"].idxmin()]
+    best = df.loc[df["Speedup"].idxmax()]
+    worst = df.loc[df["Speedup"].idxmin()]

73-82: Consider scoping rcParams changes to avoid global side effects.

Modifying plt.rcParams globally affects all subsequent plots in the same process. Consider using a context manager.

Proposed fix using context manager
-    plt.rcParams.update(
-        {
-            "figure.dpi": 120,
-            "savefig.dpi": 300,
-            "axes.titlesize": 16,
-            "axes.labelsize": 12,
-            "xtick.labelsize": 10,
-            "ytick.labelsize": 10,
-        }
-    )
+    with plt.rc_context({
+        "figure.dpi": 120,
+        "savefig.dpi": 300,
+        "axes.titlesize": 16,
+        "axes.labelsize": 12,
+        "xtick.labelsize": 10,
+        "ytick.labelsize": 10,
+    }):
+        # ... rest of the function body indented here

Note: This would require restructuring the function to place remaining code inside the with block.

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)

345-350: Optional: dO.requires_grad_() is unnecessary.

Line 350 sets requires_grad_() on dO, but upstream gradients passed to .backward() don't need gradients themselves. This is harmless but adds unnecessary overhead.

🔎 Proposed simplification
-    dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_().requires_grad_()
+    dO = torch.empty(BATCH, N_CTX, H, D_HEAD_V, dtype=torch.half, device="cuda").normal_()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feb106b and b8dbcf8.

📒 Files selected for processing (10)
  • .github/workflows/pr-regression-test-bot.yml
  • examples/dynamic_shape/example_dynamic.py
  • examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py
  • examples/flash_attention/example_gqa_fwd_bshd.py
  • examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py
  • examples/flash_attention/example_mha_fwd_varlen.py
  • examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py
  • examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py
  • maint/scripts/test_perf_regression.py
  • tilelang/testing/perf_regression.py
💤 Files with no reviewable changes (2)
  • examples/flash_attention/example_gqa_fwd_bshd.py
  • examples/gemm_streamk/regression_example_tilelang_gemm_splitk.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • examples/dynamic_shape/example_dynamic.py
🧬 Code graph analysis (3)
examples/flash_attention/example_mha_fwd_varlen.py (2)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
  • flashattn (32-170)
examples/flash_attention/example_mha_fwd_bshd_wgmma_pipelined.py (1)
  • flashattn (23-149)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (209-269)
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (5)
examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)
  • run_regression_perf (233-244)
examples/flash_attention/example_mha_bwd_bshd_wgmma_pipelined.py (3)
  • run_regression_perf (323-352)
  • run1 (312-313)
  • backward (235-256)
examples/flash_attention/example_mha_bwd_bshd.py (3)
  • run_regression_perf (346-374)
  • run1 (333-334)
  • backward (256-278)
examples/flash_attention/example_gqa_bwd_tma_reduce_varlen.py (2)
  • run1 (681-682)
  • backward (531-602)
examples/flash_attention/example_gqa_bwd_tma_reduce.py (2)
  • run1 (497-498)
  • backward (379-426)
🪛 Ruff (0.14.10)
examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

351-351: Ambiguous variable name: O

(E741)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (16)
examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py (1)

230-230: LGTM! Correct migration to type constants.

The changes properly replace string literals with type constants (T.float8_e4m3fn and T.float8_e5m2), which aligns with the tl_matmul function signature and its dtype assertions at lines 41-46.

Also applies to: 234-234

examples/dynamic_shape/example_dynamic.py (1)

7-7: Decorator simplification is safe—configurations removed are not used elsewhere in the codebase.

The two pass configurations (tl.disable_dynamic_tail_split and tl.dynamic_alignment) removed from the decorator were unique to this file and do not appear in any other example, test, or kernel in the repository. The file retains its correctness validation (torch.testing.assert_close at line 90) and performance measurement capability (run_regression_perf at lines 108–128). The simplification aligns with the PR's goal of refining performance regression testing.

examples/flash_attention/example_mha_fwd_varlen.py (1)

338-338: Add runtime verification that regression test GPU is Hopper architecture.

The parameters block_M=128, block_N=128, num_stages=2, threads=256 are documented as Hopper-recommended (line 259), but the regression test applies them unconditionally without verifying the GPU architecture. Other examples in this codebase (e.g., example_convolution.py, example_conv_analyze.py) implement check_hopper() functions before using architecture-specific parameters. Ensure a similar check is added here to guarantee valid performance baselines.

tilelang/testing/perf_regression.py (2)

12-12: LGTM!

Adding warnings import to support structured warning emission is appropriate.


35-35: LGTM!

Module-level constant for retry limit is a clean approach.

maint/scripts/test_perf_regression.py (2)

8-9: LGTM!

Adding numpy and textwrap imports for enhanced visualization logic is appropriate.


56-57: LGTM!

Good defensive check for None or empty DataFrame before proceeding with visualization.

.github/workflows/pr-regression-test-bot.yml (7)

53-67: Good security practice with permission gating.

Checking collaborator permissions before running expensive operations prevents abuse of the @regression-perf trigger by unauthorized users.


76-94: LGTM!

Good approach to use runner.tool_cache for self-hosted runners to share caches between jobs and avoid repeated downloads. Secret masking adds a security layer.


112-146: LGTM!

CUDA environment setup with dynamic version parsing and fallback PATH configuration is well-structured. The nvcc presence check with warning helps identify misconfigured runners.


148-164: LGTM!

The uv setup properly differentiates between self-hosted and GitHub-hosted runners for caching behavior. Cache suffix includes all relevant matrix dimensions.


166-187: LGTM!

Clean separation between PR and baseline environments. The git clean step correctly preserves the PR installation while preparing for baseline checkout.


200-214: LGTM!

Core dump generation is appropriately limited to GitHub-hosted runners where you control the environment. Including version/toolkit in the pattern helps identify which configuration produced a crash.


236-264: LGTM!

Reading the markdown file directly (line 243) avoids escaping issues. The comment body is well-structured with attribution, workflow link, and artifact instructions.

examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py (1)

233-240: LGTM: Simplified function signature.

Removing the unused tune parameter from run_regression_perf() simplifies the API. Verified that no callers pass this parameter to the function—the regression test suite calls it without tune, and no other imports were found using it.

examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py (1)

342-344: LGTM: Clear signature with explicit type hints.

The updated signature with typed parameters and sensible defaults improves clarity for regression testing.

Comment on lines +351 to +358
O = attention(Q, K, V, causal, groups)

def run1():
O.backward(dO, retain_graph=True)

def run_kernel_only():
kernel(Q, K, V, dO, lse, Delta, dQ, dK, dV)
from tilelang.profiler import do_bench

return do_bench(run_kernel_only, warmup=10, rep=100, backend="cupti")
return do_bench(run1, warmup=500, backend="cupti")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# First, let's find the do_bench implementation
echo "=== Searching for do_bench implementation ==="
find . -name "*.py" -type f | xargs grep -l "def do_bench" | head -5

# Search specifically in tilelang/profiler
find . -path "*tilelang/profiler*" -name "*.py" -type f

# Show the do_bench implementation
echo -e "\n=== do_bench implementation ==="
rg -A 20 "def do_bench" --type py

Repository: tile-ai/tilelang

Length of output: 9418


🏁 Script executed:

#!/bin/bash

# Check the actual file at the specified lines
echo "=== Checking example_gqa_bwd_wgmma_pipelined.py context ==="
wc -l examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 183


🏁 Script executed:

#!/bin/bash

# Get the context around lines 334 and 358 in the target file
echo "=== Context around line 334 (main function) ==="
sed -n '330,340p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

echo -e "\n=== Context around line 358 (do_bench call) ==="
sed -n '350,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

# Also check the full return statement and benchmark section
echo -e "\n=== Full benchmarking section context ==="
sed -n '345,360p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 1594


🏁 Script executed:

#!/bin/bash

# Search for backend parameter usage patterns in flash_attention examples
echo "=== Backend usage in flash_attention examples ==="
rg -n -B 2 -A 2 'backend\s*=' examples/flash_attention/ --type py

Repository: tile-ai/tilelang

Length of output: 6228


🏁 Script executed:

#!/bin/bash

# Get the full do_bench implementation from bench.py
echo "=== Full do_bench implementation ==="
sed -n '1,150p' tilelang/profiler/bench.py | head -100

# Also check if there's any gradient-clearing logic
echo -e "\n=== Checking for grad clearing logic in bench.py ==="
rg -n "grad|zero" tilelang/profiler/bench.py

Repository: tile-ai/tilelang

Length of output: 3891


🏁 Script executed:

#!/bin/bash

# Let's see the entire bench.py file to understand gradient handling
echo "=== bench.py file info ==="
wc -l tilelang/profiler/bench.py

echo -e "\n=== Full bench.py content ==="
cat -n tilelang/profiler/bench.py

Repository: tile-ai/tilelang

Length of output: 8776


🏁 Script executed:

#!/bin/bash

# Check the actual run function definition and how do_bench uses it
echo "=== Context around lines 333-345 in the target file ==="
sed -n '325,345p' examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py

Repository: tile-ai/tilelang

Length of output: 873


Clear gradients between backward iterations for consistent timing.

The run1() closure calls O.backward(dO, retain_graph=True) repeatedly without clearing gradients. Since do_bench() does not zero gradients between iterations, Q.grad, K.grad, and V.grad will accumulate values across warmup and benchmark iterations. This causes timing variations between the first iteration (gradient buffer allocation) and subsequent iterations, contaminating the measurement.

Add Q.grad = None; K.grad = None; V.grad = None before the backward call in run1(), or use torch.no_grad() if gradient tracking isn't needed for the benchmark.

The backend="cupti" choice is intentional and consistent with other backward examples in the codebase.

🧰 Tools
🪛 Ruff (0.14.10)

351-351: Ambiguous variable name: O

(E741)

🤖 Prompt for AI Agents
In examples/flash_attention/example_gqa_bwd_wgmma_pipelined.py around lines 351
to 358 the benchmark closure repeatedly calls O.backward(dO, retain_graph=True)
without clearing gradients, causing Q.grad/K.grad/V.grad to accumulate and skew
timing; before the backward call, clear gradients by setting Q.grad = None;
K.grad = None; V.grad = None (or wrap the benchmark in torch.no_grad() if
appropriate) so each iteration starts with fresh gradient buffers and produces
consistent timing measurements.


profiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)
return profiler.do_bench(warmup=500, backend="cupti")
return profiler.do_bench(backend="cupti")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the file to see the code around lines 216-244
cat -n examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py | sed -n '210,250p'

Repository: tile-ai/tilelang

Length of output: 2204


🏁 Script executed:

# Find the do_bench method definition
rg -n "def do_bench" --type=py -B 2 -A 10

Repository: tile-ai/tilelang

Length of output: 5613


🏁 Script executed:

# Search for CUPTI references in GitHub Actions workflows and CI configuration
fd -e yml -e yaml . .github/workflows/ 2>/dev/null --exec rg -l 'cupti|CUPTI' \;

# Also check for any CI configuration files
fd -e yml -e yaml -e toml -e cfg . -maxdepth 3 2>/dev/null --exec rg -l 'cupti|CUPTI' \;

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Let's check what CI workflows exist
fd -e yml -e yaml . .github/workflows/ 2>/dev/null

# Also search for any CUPTI references in the entire codebase
rg -i 'cupti' --type=py --type=yml

Repository: tile-ai/tilelang

Length of output: 274


🏁 Script executed:

# Search for CUPTI references in CI workflows without type restriction
rg -i 'cupti' .github/workflows/

# Also check the profiler backend configuration to understand CUPTI usage
rg -i 'cupti' examples/flash_attention/

# Check if there are any requirements or setup for CUPTI
rg -i 'cupti\|profiler' .github/workflows/pr-regression-test-bot.yml

Repository: tile-ai/tilelang

Length of output: 1570


🏁 Script executed:

# Check the pr-regression-test-bot.yml workflow to see what environment is used
cat .github/workflows/pr-regression-test-bot.yml

# Also check ci.yml for any CUPTI setup
cat .github/workflows/ci.yml

Repository: tile-ai/tilelang

Length of output: 30049


🏁 Script executed:

# Let's check the regression test script to understand how it's used
cat maint/scripts/test_perf_regression.py | head -100

# Also check for similar patterns in other flash attention files
rg -n "warmup=" examples/flash_attention/ -A 1 -B 1 | head -60

Repository: tile-ai/tilelang

Length of output: 8040


🏁 Script executed:

# Check if there are environment considerations or failures related to CUPTI in tests
rg -n "CUPTI|cupti|backend.*cupti" testing/ --type py -B 2 -A 2

# Let's also check similar regression tests to understand the pattern better
grep -r "def run_regression_perf" examples/flash_attention/ -A 10

Repository: tile-ai/tilelang

Length of output: 14240


Add warmup parameter to match main() profiling for consistent regression testing.

The run_regression_perf() invocation at line 244 omits the warmup parameter, which defaults to 25ms, while the main() function uses warmup=500 (lines 216, 219). This inconsistency degrades measurement reliability; flash attention kernels require adequate warmup to reach steady state and exclude JIT compilation overhead. Using insufficient warmup in regression testing produces unstable performance numbers, making reliable performance comparisons impossible.

Similar functions in the codebase (e.g., example_mha_fwd_bshd_wgmma_pipelined.py) correctly specify warmup=500 in their regression tests.

-    return profiler.do_bench(backend="cupti")
+    return profiler.do_bench(warmup=500, backend="cupti")
🤖 Prompt for AI Agents
In examples/flash_attention/example_gqa_fwd_bshd_wgmma_pipelined.py around line
244, the run_regression_perf() call omits the warmup parameter causing
inconsistent and unstable profiling; update the call to pass warmup=500 (match
main() lines ~216–219) so the regression benchmark uses the same warmup duration
as main(), ensuring sufficient warmup for flash-attention kernels and consistent
measurements.


def run_regression_perf():
M, N, K = 128, 128, 128
out_dtype, accum_dtype = "float32", "float32"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Inconsistent dtype specification causes assertion failure.

Line 229 uses string literals "float32" for out_dtype and accum_dtype, but tl_matmul expects type constants. When these strings are passed to tl_matmul at lines 231 and 235, they will fail the assertion at line 47-51, which checks out_dtype in [T.float16, T.float32, T.int32]. The string "float32" is not equal to the type constant T.float32.

🔎 Proposed fix
-    out_dtype, accum_dtype = "float32", "float32"
+    out_dtype, accum_dtype = T.float32, T.float32
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
out_dtype, accum_dtype = "float32", "float32"
out_dtype, accum_dtype = T.float32, T.float32
🤖 Prompt for AI Agents
In examples/gemm_fp8/example_tilelang_gemm_fp8_intrinsic.py around line 229 (and
the subsequent tl_matmul calls at ~231 and ~235), out_dtype and accum_dtype are
set to the string "float32" but tl_matmul expects type constants (e.g.,
T.float32). Replace the string literals with the proper type constants (for
example: out_dtype = T.float32, accum_dtype = T.float32) so the values passed
into tl_matmul match the assertion checks; ensure any other places that used
those variables expect the same T.* constants.

@LeiWang1999 LeiWang1999 merged commit d7e264f into tile-ai:main Dec 24, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants