[Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525

LeiWang1999 · 2025-12-24T12:37:41Z

as title.

Summary by CodeRabbit

Refactor
- Improved buffer allocation tracking and management across pipeline operations, including nested allocations and explicit local allocation tracking.
- Pipeline processing now skips unused buffers and applies remappings back to outer scopes for safer, more efficient memory handling.
- Added validation to prevent invalid multi-version buffer conflicts across pipelines.
Tests
- Updated example/test configuration to use 0 stages (was 1), affecting compilation-stage behavior in test kernels.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

* Added BufferUsageCollector class to identify and collect buffers used in pipeline loop bodies, enabling proper multi-versioning for software pipelining. * Updated PipelineRewriter to handle local and outer block buffer allocations more effectively, ensuring that only necessary buffers are included in the pipeline. * Enhanced buffer remapping logic to prevent conflicts when buffers from outer blocks are used in multiple pipeline loops. This update improves the efficiency and correctness of buffer management during software pipelining.

* Adjusted formatting of buffer allocation declarations for improved readability. * Ensured consistent style in the codebase by aligning variable declarations. This change enhances code clarity without altering functionality.

github-actions · 2025-12-24T12:37:49Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-24T12:37:51Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Introduces BufferUsageCollector to identify buffers actually used in a statement and extends PipelineRewriter to accept local_allocs, enabling correct remapping/allocation of buffers across pipeline and outer blocks; adds conflict detection and propagation of remappings to outer block alloc_buffers.

Changes

Cohort / File(s)	Summary
Pipeline injection & buffer tracking `src/transform/inject_pipeline.cc`	Adds `BufferUsageCollector` (collects buffers used in a Stmt, respecting nested blocks and allocation guards). Extends `PipelineRewriter` constructor to accept `local_allocs` and exposes `GetBufferRemap()`. PipelineRewriter now skips non-accessed buffers when computing versions. `PipelineInjector` now collects `pipeline_allocs`, tracks local vs nested allocations, checks for multi-pipeline buffer conflicts, and applies pending remappings into outer block `alloc_buffers` during block visits. Also updates internal storage for `buffers_used_in_pipeline` and allocation tracking.

Sequence Diagram

sequenceDiagram
    participant PI as PipelineInjector
    participant BUC as BufferUsageCollector
    participant PR as PipelineRewriter
    participant OB as OuterBlock (alloc_buffers)

    PI->>BUC: Collect(stmt)                     %% gather used buffers
    BUC-->>PI: pipeline_allocs

    PI->>PI: Build local_allocs (track nested vs outer)
    PI->>PI: Check conflicts across pipelines

    PI->>PR: Construct(pipeline_allocs, local_allocs, ...)
    PR->>PR: Compute versions (skip non-accessed buffers)
    PR-->>PI: buffer_remap

    PI->>OB: Apply buffer_remap to alloc_buffers
    OB->>OB: Update alloc_buffers entries
    OB->>OB: Erase old mappings
    OB-->>PI: Remappings applied

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hop through code with careful paws,
I count the buffers, follow laws,
Local nests and outer beds,
I patch their names and mend their threads,
A tiny rabbit, fixing flaws.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: refactoring buffer allocation in the Inject Pipeline Pass, which aligns with the detailed changes shown in the summary involving BufferUsageCollector, PipelineRewriter enhancements, and buffer allocation logic.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3d4c33 and d38c8d8.

📒 Files selected for processing (1)

examples/gdn/test_example_gdn_compilation.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (1)

examples/gdn/test_example_gdn_compilation.py (1)

23-23: This file was newly created with num_stages = 0, not changed from an existing value.

The GDN test file is new and uses num_stages = 0 by design, consistent with all other GDN example kernels (wy_fast, chunk_o, chunk_delta_h, etc.). This appears intentional rather than a problematic configuration change. The systematic use of num_stages = 0 across the GDN suite may relate to known pipelining constraints (see test_tilelang_issue_96.py which notes "changing num_stages to 0 gives correct results").

Likely an incorrect or invalid review comment.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/transform/inject_pipeline.cc (2)

79-85: Inconsistent buffer addition for nested block allocations.

In VisitStmt_(BlockNode*), buffers from op->alloc_buffers are added directly to used_buffers_ without checking allocated_buffers_, unlike AddBuffer() which filters through allocated_buffers_. This appears intentional since nested block buffers need multi-versioning regardless, but please verify this asymmetry is correct.

If this is intentional, consider adding a brief comment explaining why nested block allocations bypass the allocated_buffers_ check.

1148-1157: Consider extracting nested block iteration into a helper.

The pattern of iterating over pipeline_body_seq->seq to find nested blocks with SeqStmt bodies appears three times in this function (lines 1122-1133, 1148-1157, and 1216-1226). Consider extracting this into a helper function or consolidating the operations into a single pass.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfccd63 and b3d4c33.

📒 Files selected for processing (1)

src/transform/inject_pipeline.cc

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/inject_pipeline.cc

🧬 Code graph analysis (1)

src/transform/inject_pipeline.cc (2)

src/transform/multi_version_buffer_rewriter.cc (2)

buffer (255-265)

buffer (255-255)

src/transform/inject_tma_barrier.cc (3)

collector (416-416)

collector (477-477)

rewriter (96-96)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)

🔇 Additional comments (5)

src/transform/inject_pipeline.cc (5)

293-315: LGTM!

The constructor documentation clearly explains the distinction between pipeline_allocs (all buffers needing multi-versioning) and local_allocs (buffers for re-allocation in the rewritten block). This separation properly handles outer-block buffers that should not be re-allocated locally.

410-414: LGTM!

Clean accessor method that properly exposes the buffer remapping for updating outer block allocations.

1159-1170: Good addition of conflict detection for multi-versioned buffers.

This check correctly prevents undefined behavior when the same external buffer would be multi-versioned across multiple pipeline loops. The error message is clear and actionable.

1277-1296: LGTM!

The pending buffer remap logic correctly applies remappings to outer block alloc_buffers after visiting children. Erasing from pending_buffer_remap_ after applying prevents duplicate remapping. The traversal order (children first via StmtExprMutator::VisitStmt_, then remap application) ensures pipeline rewriting in ForNode visitor happens before this remap is applied.

1325-1331: LGTM!

The new member variables are appropriately typed and serve clear purposes:

allocated_buffers_: enables the collector to filter function I/O buffers

pending_buffer_remap_: defers remap application to outer block processing

buffers_used_in_pipeline_: tracks external buffers for conflict detection

src/transform/inject_pipeline.cc

LeiWang1999 · 2025-12-24T13:01:31Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/transform/inject_pipeline.cc

LeiWang1999 · 2025-12-24T13:24:34Z

@regression-perf

github-actions · 2025-12-24T15:16:38Z

Performance Benchmark Report

Triggered by: @LeiWang1999
Workflow run: https://github.com/tile-ai/tilelang/actions/runs/20487148153

Results

File	Original Latency	Current Latency	Speedup
sparse_mla_fwd	0.382253	0.393077	0.972461
example_warp_specialize_gemm_softpipe_stage2	0.038849	0.039361	0.986992
example_tilelang_gemm_fp8_intrinsic	0.002656	0.0026875	0.988279
example_warp_specialize_gemm_copy_1_gemm_0	0.038176	0.03856	0.990041
example_convolution_autotune	1.00044	0.994001	1.00647
example_per_token_cast_to_fp8	0.00752166	0.0074651	1.00758
example_warp_specialize_gemm_copy_0_gemm_1	0.03968	0.039361	1.0081
example_mha_sink_bwd_bhsd	0.0565265	0.0559693	1.00996
example_dynamic	0.028449	0.028001	1.016
example_tilelang_nsa_fwd	0.00715433	0.00703598	1.01682
sparse_mla_fwd_pipelined	0.122002	0.119984	1.01682
example_tilelang_nsa_decode	0.00684381	0.00672851	1.01714
example_dequant_gemv_fp16xint4	0.00382928	0.00376196	1.0179
example_mha_sink_fwd_bhsd_sliding_window	0.013088	0.0128436	1.01903
example_gqa_decode	0.049634	0.048673	1.01974
example_blocksparse_gemm	0.0230624	0.0226143	1.01981
example_gemm_schedule	0.0330568	0.0324072	1.02005
example_mha_sink_fwd_bhsd_wgmma_pipelined	0.0157559	0.0154393	1.02051
example_gqa_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.014888	0.0145873	1.02062
example_group_per_split_token_cast_to_fp8	0.0105471	0.0103309	1.02093
example_mha_sink_fwd_bhsd_wgmma_pipelined_sliding_window	0.0158281	0.015501	1.0211
fp8_lighting_indexer	0.0366457	0.0358875	1.02113
example_tilelang_gemm_splitk_vectorize_atomicadd	0.0503977	0.0493025	1.02221
example_tilelang_gemm_splitk	0.0503619	0.0492637	1.02229
example_gqa_sink_fwd_bhsd_wgmma_pipelined	0.0147931	0.0144454	1.02407
example_dequant_gemm_w4a8	0.005375	0.005248	1.0242
block_sparse_attn_tilelang	0.0104998	0.0102512	1.02425
example_linear_attn_fwd	0.0372223	0.0363358	1.0244
sparse_mla_bwd	0.392935	0.383489	1.02463
example_mha_inference	0.0818252	0.079835	1.02493
example_mha_fwd_varlen	0.0456897	0.0445441	1.02572
example_elementwise_add	0.0228358	0.0222348	1.02703
tilelang_example_sparse_tensorcore	0.0154533	0.0150447	1.02716
example_tilelang_sparse_gqa_decode_varlen_indice	0.0175062	0.0170367	1.02756
example_tilelang_block_sparse_attn	0.00801354	0.00779587	1.02792
example_gemm	0.023296	0.022656	1.02825
example_tilelang_sparse_gqa_decode_varlen_mask	0.0239587	0.0232995	1.02829
example_mha_bwd_bshd_wgmma_pipelined	0.0266238	0.025887	1.02846
example_linear_attn_bwd	0.155903	0.151562	1.02864
example_mha_sink_fwd_bhsd	0.0128616	0.0125035	1.02865
example_dequant_gemm_fp4_hopper	0.012608	0.012256	1.02872
example_gqa_bwd_wgmma_pipelined	0.0759523	0.0737451	1.02993
example_mha_sink_bwd_bhsd_sliding_window	0.0337344	0.0327255	1.03083
example_mha_bwd_bhsd	0.041422	0.0401725	1.0311
example_mha_bwd_bshd	0.042358	0.0410281	1.03241
example_fusedmoe_tilelang	0.15021	0.145483	1.03249
example_tilelang_gemm_fp8	0.015616	0.01512	1.0328
example_dequant_groupedgemm_bf16_mxfp4_hopper	0.0200763	0.0194382	1.03283
example_gemm_autotune	0.023136	0.022368	1.03433
example_dequant_gemm_bf16_fp4_hopper	0.015296	0.014784	1.03463
topk_selector	0.0555474	0.0536876	1.03464
example_gqa_bwd	0.0513727	0.0496329	1.03505
example_gemm_intrinsics	0.036384	0.035136	1.03552
example_gqa_bwd_tma_reduce_varlen	0.0658616	0.0635316	1.03668
example_gqa_sink_bwd_bhsd_sliding_window	0.0264742	0.0255244	1.03721
example_gqa_sink_bwd_bhsd	0.0432679	0.0416996	1.03761
example_convolution	1.3688	1.31753	1.03892
example_dequant_gemm_bf16_mxfp4_hopper	0.012736	0.012256	1.03916
example_vertical_slash_sparse_attn	0.247504	0.238039	1.03976
example_mla_decode	0.473995	0.45495	1.04186
example_warp_specialize_gemm_barrierpipe_stage2	0.039585	0.037889	1.04476
example_gemv	0.0685612	0.065532	1.04622
example_topk	0.011488	0.010976	1.04665
example_tilelang_gemm_fp8_2xAcc	0.197026	0.185505	1.0621

Artifacts

regression_result.png (speedup plot) is attached as a workflow artifact. Download it from the workflow run page above.

LeiWang1999 added 2 commits December 24, 2025 20:35

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

src/transform/inject_pipeline.cc Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 24, 2025

View reviewed changes

src/transform/inject_pipeline.cc Show resolved Hide resolved

test fix

d38c8d8

LeiWang1999 merged commit d140415 into tile-ai:main Dec 24, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525

[Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525

LeiWang1999 commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

LeiWang1999 commented Dec 24, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

LeiWang1999 commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525

[Pipeline] Refactor buffer allocation in Inject Pipeline Pass #1525

Conversation

LeiWang1999 commented Dec 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeiWang1999 commented Dec 24, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

LeiWang1999 commented Dec 24, 2025

Uh oh!

github-actions bot commented Dec 24, 2025

Performance Benchmark Report

Results

Artifacts

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 24, 2025 •

edited

Loading