Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Dec 24, 2025

as title.

Summary by CodeRabbit

  • Refactor

    • Improved buffer allocation tracking and management across pipeline operations, including nested allocations and explicit local allocation tracking.
    • Pipeline processing now skips unused buffers and applies remappings back to outer scopes for safer, more efficient memory handling.
    • Added validation to prevent invalid multi-version buffer conflicts across pipelines.
  • Tests

    • Updated example/test configuration to use 0 stages (was 1), affecting compilation-stage behavior in test kernels.

✏️ Tip: You can customize this high-level summary in your review settings.

* Added BufferUsageCollector class to identify and collect buffers used in pipeline loop bodies, enabling proper multi-versioning for software pipelining.
* Updated PipelineRewriter to handle local and outer block buffer allocations more effectively, ensuring that only necessary buffers are included in the pipeline.
* Enhanced buffer remapping logic to prevent conflicts when buffers from outer blocks are used in multiple pipeline loops.

This update improves the efficiency and correctness of buffer management during software pipelining.
* Adjusted formatting of buffer allocation declarations for improved readability.
* Ensured consistent style in the codebase by aligning variable declarations.

This change enhances code clarity without altering functionality.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Introduces BufferUsageCollector to identify buffers actually used in a statement and extends PipelineRewriter to accept local_allocs, enabling correct remapping/allocation of buffers across pipeline and outer blocks; adds conflict detection and propagation of remappings to outer block alloc_buffers.

Changes

Cohort / File(s) Summary
Pipeline injection & buffer tracking
src/transform/inject_pipeline.cc
Adds BufferUsageCollector (collects buffers used in a Stmt, respecting nested blocks and allocation guards). Extends PipelineRewriter constructor to accept local_allocs and exposes GetBufferRemap(). PipelineRewriter now skips non-accessed buffers when computing versions. PipelineInjector now collects pipeline_allocs, tracks local vs nested allocations, checks for multi-pipeline buffer conflicts, and applies pending remappings into outer block alloc_buffers during block visits. Also updates internal storage for buffers_used_in_pipeline and allocation tracking.

Sequence Diagram

sequenceDiagram
    participant PI as PipelineInjector
    participant BUC as BufferUsageCollector
    participant PR as PipelineRewriter
    participant OB as OuterBlock (alloc_buffers)

    PI->>BUC: Collect(stmt)                     %% gather used buffers
    BUC-->>PI: pipeline_allocs

    PI->>PI: Build local_allocs (track nested vs outer)
    PI->>PI: Check conflicts across pipelines

    PI->>PR: Construct(pipeline_allocs, local_allocs, ...)
    PR->>PR: Compute versions (skip non-accessed buffers)
    PR-->>PI: buffer_remap

    PI->>OB: Apply buffer_remap to alloc_buffers
    OB->>OB: Update alloc_buffers entries
    OB->>OB: Erase old mappings
    OB-->>PI: Remappings applied
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through code with careful paws,
I count the buffers, follow laws,
Local nests and outer beds,
I patch their names and mend their threads,
A tiny rabbit, fixing flaws.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: refactoring buffer allocation in the Inject Pipeline Pass, which aligns with the detailed changes shown in the summary involving BufferUsageCollector, PipelineRewriter enhancements, and buffer allocation logic.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3d4c33 and d38c8d8.

📒 Files selected for processing (1)
  • examples/gdn/test_example_gdn_compilation.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (1)
examples/gdn/test_example_gdn_compilation.py (1)

23-23: This file was newly created with num_stages = 0, not changed from an existing value.

The GDN test file is new and uses num_stages = 0 by design, consistent with all other GDN example kernels (wy_fast, chunk_o, chunk_delta_h, etc.). This appears intentional rather than a problematic configuration change. The systematic use of num_stages = 0 across the GDN suite may relate to known pipelining constraints (see test_tilelang_issue_96.py which notes "changing num_stages to 0 gives correct results").

Likely an incorrect or invalid review comment.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/transform/inject_pipeline.cc (2)

79-85: Inconsistent buffer addition for nested block allocations.

In VisitStmt_(BlockNode*), buffers from op->alloc_buffers are added directly to used_buffers_ without checking allocated_buffers_, unlike AddBuffer() which filters through allocated_buffers_. This appears intentional since nested block buffers need multi-versioning regardless, but please verify this asymmetry is correct.

If this is intentional, consider adding a brief comment explaining why nested block allocations bypass the allocated_buffers_ check.


1148-1157: Consider extracting nested block iteration into a helper.

The pattern of iterating over pipeline_body_seq->seq to find nested blocks with SeqStmt bodies appears three times in this function (lines 1122-1133, 1148-1157, and 1216-1226). Consider extracting this into a helper function or consolidating the operations into a single pass.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfccd63 and b3d4c33.

📒 Files selected for processing (1)
  • src/transform/inject_pipeline.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/inject_pipeline.cc
🧬 Code graph analysis (1)
src/transform/inject_pipeline.cc (2)
src/transform/multi_version_buffer_rewriter.cc (2)
  • buffer (255-265)
  • buffer (255-255)
src/transform/inject_tma_barrier.cc (3)
  • collector (416-416)
  • collector (477-477)
  • rewriter (96-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (5)
src/transform/inject_pipeline.cc (5)

293-315: LGTM!

The constructor documentation clearly explains the distinction between pipeline_allocs (all buffers needing multi-versioning) and local_allocs (buffers for re-allocation in the rewritten block). This separation properly handles outer-block buffers that should not be re-allocated locally.


410-414: LGTM!

Clean accessor method that properly exposes the buffer remapping for updating outer block allocations.


1159-1170: Good addition of conflict detection for multi-versioned buffers.

This check correctly prevents undefined behavior when the same external buffer would be multi-versioned across multiple pipeline loops. The error message is clear and actionable.


1277-1296: LGTM!

The pending buffer remap logic correctly applies remappings to outer block alloc_buffers after visiting children. Erasing from pending_buffer_remap_ after applying prevents duplicate remapping. The traversal order (children first via StmtExprMutator::VisitStmt_, then remap application) ensures pipeline rewriting in ForNode visitor happens before this remap is applied.


1325-1331: LGTM!

The new member variables are appropriately typed and serve clear purposes:

  • allocated_buffers_: enables the collector to filter function I/O buffers
  • pending_buffer_remap_: defers remap application to outer block processing
  • buffers_used_in_pipeline_: tracks external buffers for conflict detection

@LeiWang1999
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@LeiWang1999
Copy link
Member Author

@regression-perf

@github-actions
Copy link

Performance Benchmark Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20487148153

Results

File Original Latency Current Latency Speedup
sparse_mla_fwd 0.382253 0.393077 0.972461
example_warp_specialize_gemm_softpipe_stage2 0.038849 0.039361 0.986992
example_tilelang_gemm_fp8_intrinsic 0.002656 0.0026875 0.988279
example_warp_specialize_gemm_copy_1_gemm_0 0.038176 0.03856 0.990041
example_convolution_autotune 1.00044 0.994001 1.00647
example_per_token_cast_to_fp8 0.00752166 0.0074651 1.00758
example_warp_specialize_gemm_copy_0_gemm_1 0.03968 0.039361 1.0081
example_mha_sink_bwd_bhsd 0.0565265 0.0559693 1.00996
example_dynamic 0.028449 0.028001 1.016
example_tilelang_nsa_fwd 0.00715433 0.00703598 1.01682
sparse_mla_fwd_pipelined 0.122002 0.119984 1.01682
example_tilelang_nsa_decode 0.00684381 0.00672851 1.01714
example_dequant_gemv_fp16xint4 0.00382928 0.00376196 1.0179
example_mha_sink_fwd_bhsd_sliding_window 0.013088 0.0128436 1.01903
example_gqa_decode 0.049634 0.048673 1.01974
example_blocksparse_gemm 0.0230624 0.0226143 1.01981
example_gemm_schedule 0.0330568 0.0324072 1.02005
example_mha_sink_fwd_bhsd_wgmma_pipelined 0.0157559 0.0154393 1.02051
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.014888 0.0145873 1.02062
example_group_per_split_token_cast_to_fp8 0.0105471 0.0103309 1.02093
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window 0.0158281 0.015501 1.0211
fp8_lighting_indexer 0.0366457 0.0358875 1.02113
example_tilelang_gemm_splitk_vectorize_atomicadd 0.0503977 0.0493025 1.02221
example_tilelang_gemm_splitk 0.0503619 0.0492637 1.02229
example_gqa_sink_fwd_bhsd_wgmma_pipelined 0.0147931 0.0144454 1.02407
example_dequant_gemm_w4a8 0.005375 0.005248 1.0242
block_sparse_attn_tilelang 0.0104998 0.0102512 1.02425
example_linear_attn_fwd 0.0372223 0.0363358 1.0244
sparse_mla_bwd 0.392935 0.383489 1.02463
example_mha_inference 0.0818252 0.079835 1.02493
example_mha_fwd_varlen 0.0456897 0.0445441 1.02572
example_elementwise_add 0.0228358 0.0222348 1.02703
tilelang_example_sparse_tensorcore 0.0154533 0.0150447 1.02716
example_tilelang_sparse_gqa_decode_varlen_indice 0.0175062 0.0170367 1.02756
example_tilelang_block_sparse_attn 0.00801354 0.00779587 1.02792
example_gemm 0.023296 0.022656 1.02825
example_tilelang_sparse_gqa_decode_varlen_mask 0.0239587 0.0232995 1.02829
example_mha_bwd_bshd_wgmma_pipelined 0.0266238 0.025887 1.02846
example_linear_attn_bwd 0.155903 0.151562 1.02864
example_mha_sink_fwd_bhsd 0.0128616 0.0125035 1.02865
example_dequant_gemm_fp4_hopper 0.012608 0.012256 1.02872
example_gqa_bwd_wgmma_pipelined 0.0759523 0.0737451 1.02993
example_mha_sink_bwd_bhsd_sliding_window 0.0337344 0.0327255 1.03083
example_mha_bwd_bhsd 0.041422 0.0401725 1.0311
example_mha_bwd_bshd 0.042358 0.0410281 1.03241
example_fusedmoe_tilelang 0.15021 0.145483 1.03249
example_tilelang_gemm_fp8 0.015616 0.01512 1.0328
example_dequant_groupedgemm_bf16_mxfp4_hopper 0.0200763 0.0194382 1.03283
example_gemm_autotune 0.023136 0.022368 1.03433
example_dequant_gemm_bf16_fp4_hopper 0.015296 0.014784 1.03463
topk_selector 0.0555474 0.0536876 1.03464
example_gqa_bwd 0.0513727 0.0496329 1.03505
example_gemm_intrinsics 0.036384 0.035136 1.03552
example_gqa_bwd_tma_reduce_varlen 0.0658616 0.0635316 1.03668
example_gqa_sink_bwd_bhsd_sliding_window 0.0264742 0.0255244 1.03721
example_gqa_sink_bwd_bhsd 0.0432679 0.0416996 1.03761
example_convolution 1.3688 1.31753 1.03892
example_dequant_gemm_bf16_mxfp4_hopper 0.012736 0.012256 1.03916
example_vertical_slash_sparse_attn 0.247504 0.238039 1.03976
example_mla_decode 0.473995 0.45495 1.04186
example_warp_specialize_gemm_barrierpipe_stage2 0.039585 0.037889 1.04476
example_gemv 0.0685612 0.065532 1.04622
example_topk 0.011488 0.010976 1.04665
example_tilelang_gemm_fp8_2xAcc 0.197026 0.185505 1.0621

Artifacts

  • regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

@LeiWang1999 LeiWang1999 merged commit d140415 into tile-ai:main Dec 24, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant