[Feat] PDL Support #1494

w169q169 · 2025-12-22T08:06:10Z

Add PDL support as per Issue #1463

Summary by CodeRabbit

New Features
- Added pdl_trigger and pdl_sync primitives for CUDA kernel synchronization
- Automatic detection of PDL-capable CUDA targets (compute ≥ 9.0)
- Per-kernel dispatch path and build-time transform to annotate or remove PDL sync usage
Documentation
- New "Synchronization helpers" subsection in the Instruction Reference documenting the primitives
Tests
- New unit and integration tests exercising PDL trigger/sync across codegen and JIT backends

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-12-22T08:06:20Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-22T08:06:36Z

📝 Walkthrough

Walkthrough

Adds CUDA PDL support: two TileLang helpers (pdl_trigger, pdl_sync), a C++ IR pass to mark or eliminate corresponding extern CUDA calls based on target capability, target detection, compilation-pipeline integration, NVRTC/CUDA launch-path wiring, and unit/integration tests.

Changes

Cohort / File(s)	Summary
Docs `docs/programming_guides/instructions.md`	Documented new Synchronization helpers: `pdl_trigger()` and `pdl_sync()`.
Language API `tilelang/language/pdl.py`, `tilelang/language/__init__.py`	New helpers `pdl_trigger()` / `pdl_sync()` that emit extern CUDA calls and are re-exported from the package.
C++ transform pass `src/transform/lower_pdl.cc`, `src/transform/common/attr.h`	Added `MarkCudaSyncCalls` and `EliminateCudaSyncCalls` mutators; new pass factory `MarkCudaSyncCallsPass(bool have_pdl)`; new attribute names `has_cuda_pdl_trigger` / `has_cuda_pdl_sync`.
Transform binding `tilelang/transform/__init__.py`	Exposed `MarkCudaSyncCalls(have_pdl: bool = False)` Python wrapper for the C++ pass.
Target detection `tilelang/contrib/nvcc.py`	Added `have_pdl(target)` to detect SM >= 9.x for enabling PDL paths.
Engine pipeline `tilelang/engine/phase.py`	Conditionally injects the MarkCudaSyncCalls pass into OptimizeForTarget when `have_pdl(target)` is true.
CUDA host wrapper `tilelang/jit/adapter/wrapper.py`	Added `pdl_sync_map` tracking, changed host function iteration to use name hints, and route marked functions to a programmatic stream-serialization launch path.
NVRTC adapter `tilelang/jit/adapter/nvrtc/wrapper.py`	Added `PDL_SYNC_PY` snippet and `generate_pdl_sync_code(function_name)` to emit per-kernel PDL launch attribute code; injects snippet into generated launch code.
JIT tests `testing/python/jit/test_tilelang_jit_nvrtc.py`, `testing/python/jit/test_tilelang_jit_cython.py`	Added integration tests exercising kernels using `pdl_trigger`/`pdl_sync` and validating runtime behavior.
Language tests `testing/python/language/test_tilelang_language_pdl.py`	Added tests verifying codegen includes/omits CUDA PDL extern symbols across SM targets.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as User Code
    participant Lang as TileLang\n(pdl helpers)
    participant IR as TVM IR\n(call_extern / PrimFunc)
    participant Transform as MarkCudaSyncCallsPass
    participant Engine as OptimizeForTarget
    participant Wrapper as CUDA Host Wrapper
    participant NVRTC as NVRTC Adapter
    participant CUDA as CUDA Runtime

    User->>Lang: call pdl_trigger()/pdl_sync()
    Lang->>IR: emit tir.call_extern("cudaTriggerProgrammaticLaunchCompletion" / "cudaGridDependencySynchronize")
    Engine->>Transform: Apply MarkCudaSyncCallsPass(have_pdl)
    alt have_pdl = true
        Transform->>IR: mark PrimFunc attrs\nhas_cuda_pdl_trigger / has_cuda_pdl_sync
    else have_pdl = false
        Transform->>IR: replace extern calls with 0
    end
    IR->>Wrapper: parse module, read function attrs
    Wrapper->>Wrapper: populate pdl_sync_map for marked functions
    Wrapper->>NVRTC: select launch path (PDL vs standard)
    NVRTC->>NVRTC: inject PDL_SYNC_PY if needed
    NVRTC->>CUDA: launch kernels (with/without programmatic serialization)
    CUDA->>CUDA: execute kernels

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

[Feature Request] PDL Support #1463 – Matches the PDL feature request (T.pdl_trigger / T.pdl_sync and PDL initialization); this PR implements those helpers, IR handling, and runtime wiring.

Possibly related PRs

feat(cutedsl): add CuTeDSL backend #1421 – Modifies the CUDA wrapper layer; overlaps with this PR's changes to tilelang/jit/adapter/wrapper.py and dispatch logic.
[fix] NVRTC execution backend #1256 – Touches NVRTC launch generation; relates to this PR's tilelang/jit/adapter/nvrtc/wrapper.py changes.

Suggested reviewers

LeiWang1999

"I hop and code beneath the moon,
I signal kernels, then they tune,
A trigger here, a sync there too,
Streams march orderly — that's my cue 🐇✨"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.95% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Feat] PDL Support' clearly identifies the main feature being added (PDL support) and aligns with the comprehensive changeset covering PDL integration across language, transformation, JIT, and testing layers.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (8)

tilelang/jit/adapter/libgen.py (1)
151-153: Consider truncating lib_code in error message.

Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting self.lib_code from the error, or logging it separately at debug level.
🔎 Proposed improvement
         if ret.returncode != 0:
             command_str = " ".join(command)
-            raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}")
+            raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
505-552: Consider adding a compute capability check for PDL support.

Based on tilelang/contrib/nvcc.py (lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, like test_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.
🔎 Proposed fix
 def test_nvrtc_pdl():
     """Test pdl."""
+    if not check_hopper():
+        import pytest
+
+        pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support")
 
     N = 64
tilelang/jit/adapter/wrapper.py (1)
213-213: Consider using set instead of dict for pdl_sync_map.

The pdl_sync_map dictionary always stores 0 as the value (line 450). If the value is unused, a set would be more appropriate and clearer in intent.
🔎 Proposed change
-        self.pdl_sync_map: dict[str, int] | None = {}
+        self.pdl_sync_map: set[str] | None = set()
And update usages:
-            if "has_cuda_pdl_sync" in attrs:
-                self.pdl_sync_map[function_name] = 0
+            if "has_cuda_pdl_sync" in attrs:
+                self.pdl_sync_map.add(function_name)
tilelang/transform/__init__.py (1)
403-406: LGTM with suggestion: Consider expanding the docstring.

The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the have_pdl parameter and return type.
🔎 Suggested docstring enhancement
 def MarkCudaSyncCalls(have_pdl: bool = False):
-    """MarkCudaSyncCalls"""
+    """Mark CUDA synchronization calls for PDL support.
+
+    Parameters
+    ----------
+    have_pdl : bool
+        Whether the target supports PDL (Programmatic Device Launch).
+        PDL is available on compute capability >= 9.0 (Hopper+).
+
+    Returns
+    -------
+    fpass : tvm.transform.Pass
+        The result pass
+    """
     return _ffi_api.MarkCudaSyncCalls(have_pdl)  # type: ignore
tilelang/contrib/nvcc.py (1)
590-596: Minor: Prefix unused variable with underscore.

The minor variable from parse_compute_version is unpacked but never used. This is a valid static analysis hint.
Proposed fix
 def have_pdl(target):
     if target.kind.name != "cuda":
         return False
     compute_version = get_target_compute_version(target)
-    major, minor = parse_compute_version(compute_version)
+    major, _ = parse_compute_version(compute_version)
     return major >= 9
testing/python/jit/test_tilelang_jit_ctypes.py (1)
1-1: Remove unused import.

The tvm import is not used in this test file.
Proposed fix
-from tilelang import tvm as tvm
 import tilelang.language as T
src/transform/lower_pdl.cc (2)
83-83: Typo in class name: ElininateCudaSyncCalls → EliminateCudaSyncCalls.

The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.
Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator {
+class EliminateCudaSyncCalls : public StmtExprMutator {
 public:
   static PrimFunc Substitute(PrimFunc f) {
-    ElininateCudaSyncCalls mutator;
+    EliminateCudaSyncCalls mutator;
     PrimFunc new_f = f;
     new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body);
 
     return new_f;
   }
   // ... rest of class ...
 private:
-  ElininateCudaSyncCalls() = default;
+  EliminateCudaSyncCalls() = default;
 };
Also update line 141:
   return have_pdl ? MarkCudaSyncCalls::Substitute(f)
-                  : ElininateCudaSyncCalls::Substitute(f);
+                  : EliminateCudaSyncCalls::Substitute(f);
57-78: Duplicate private: access specifier.

There are two private: access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.
Proposed fix
-private:
   void CheckCall(const tir::CallNode *call) {
     if (!call)
       return;
     // ... implementation ...
   }

-private:
   bool has_trigger_launch_ = false;
   bool has_grid_sync_ = false;

   MarkCudaSyncCalls() = default;
+
+private:
+  void CheckCall(const tir::CallNode *call) {
+    if (!call)
+      return;
+    // ... implementation ...
+  }
+
+  bool has_trigger_launch_ = false;
+  bool has_grid_sync_ = false;
+
+  MarkCudaSyncCalls() = default;
 };
Note: The Cppcheck "syntax error" on line 147 is a false positive — TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for FFI registration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a431797 and 2a67387.

📒 Files selected for processing (13)

docs/programming_guides/instructions.md
src/transform/lower_pdl.cc
testing/python/jit/test_tilelang_jit_ctypes.py
testing/python/jit/test_tilelang_jit_nvrtc.py
testing/python/language/test_tilelang_language_pdl.py
tilelang/contrib/nvcc.py
tilelang/engine/phase.py
tilelang/jit/adapter/libgen.py
tilelang/jit/adapter/nvrtc/wrapper.py
tilelang/jit/adapter/wrapper.py
tilelang/language/__init__.py
tilelang/language/pdl.py
tilelang/transform/__init__.py

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-14T07:56:11.098Z

Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

testing/python/jit/test_tilelang_jit_nvrtc.py
testing/python/jit/test_tilelang_jit_ctypes.py
tilelang/jit/adapter/nvrtc/wrapper.py

📚 Learning: 2025-12-18T04:50:00.512Z

Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

testing/python/language/test_tilelang_language_pdl.py

🧬 Code graph analysis (10)

tilelang/transform/__init__.py (2)

src/transform/lower_pdl.cc (1)

MarkCudaSyncCalls (80-80)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

tilelang/language/__init__.py (1)

tilelang/language/pdl.py (2)

pdl_trigger (10-14)

pdl_sync (17-21)

tilelang/language/pdl.py (1)

tilelang/language/tir/op.py (1)

call_extern (173-195)

tilelang/engine/phase.py (3)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

tilelang/transform/__init__.py (1)

MarkCudaSyncCalls (403-405)

tilelang/language/ast/ir.py (1)

target (1677-1707)

testing/python/language/test_tilelang_language_pdl.py (1)

tilelang/language/pdl.py (2)

pdl_trigger (10-14)

pdl_sync (17-21)

testing/python/jit/test_tilelang_jit_ctypes.py (3)

testing/python/jit/test_tilelang_jit_nvrtc.py (7)

test_nvrtc_pdl (505-552)

multi_kernels_with_pdl (511-532)

main (30-50)

main (138-158)

main (384-413)

main (513-530)

kernel (469-484)

testing/python/language/test_tilelang_language_pdl.py (2)

main (7-16)

main (23-32)

tilelang/language/pdl.py (2)

pdl_trigger (10-14)

pdl_sync (17-21)

src/transform/lower_pdl.cc (2)

tilelang/language/tir/op.py (1)

call_extern (173-195)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

tilelang/contrib/nvcc.py (1)

tilelang/contrib/rocm.py (1)

parse_compute_version (179-201)

tilelang/jit/adapter/wrapper.py (1)

tilelang/jit/adapter/utils.py (1)

parse_function_call_args (315-357)

tilelang/jit/adapter/libgen.py (2)

tilelang/jit/adapter/ctypes/adapter.py (1)

lib_code (287-289)

tilelang/jit/adapter/cython/adapter.py (1)

lib_code (371-373)

🪛 Cppcheck (2.18.0)

src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🪛 Ruff (0.14.8)

tilelang/language/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/language/pdl.py

4-7: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/contrib/nvcc.py

594-594: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tilelang/jit/adapter/libgen.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (8)

tilelang/jit/adapter/wrapper.py (3)

141-155: LGTM: PDL kernel launch code structure looks correct.

The KERNEL_LAUNCH_FUNC_CODE correctly configures cudaLaunchConfig_t with cudaLaunchAttributeProgrammaticStreamSerialization and uses cudaLaunchKernelEx for the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.

313-330: LGTM: PDL branch in dispatch function is correctly structured.

The new PDL path properly:

Parses function call arguments

Validates argument count matches parameters

Uses KERNEL_LAUNCH_FUNC_CODE for kernel launch with PDL attributes

Includes error checking after launch

448-450: LGTM: PDL attribute detection correctly populates the sync map.

The has_cuda_pdl_sync attribute is checked and the function is registered in pdl_sync_map for later use during code generation.

tilelang/language/__init__.py (1)

119-123: LGTM!

The PDL function re-exports follow the established pattern in this file. The # noqa: F401 comments are appropriate for suppressing "imported but unused" warnings in __init__.py files intended to expose public API.

docs/programming_guides/instructions.md (1)

141-144: LGTM!

Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.

tilelang/engine/phase.py (1)

255-257: LGTM!

The MarkCudaSyncCalls pass is correctly integrated after SplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.

testing/python/language/test_tilelang_language_pdl.py (1)

37-61: Test implementation looks good.

The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the MarkCudaSyncCalls pass behavior.

src/transform/lower_pdl.cc (1)

29-43: LGTM on the MarkCudaSyncCalls implementation.

The mutator correctly scans for CUDA sync extern calls and annotates the PrimFunc with has_cuda_pdl_trigger and has_cuda_pdl_sync attributes. The pattern of visiting both statements and expressions ensures comprehensive detection.

testing/python/jit/test_tilelang_jit_ctypes.py

testing/python/language/test_tilelang_language_pdl.py

tilelang/jit/adapter/nvrtc/wrapper.py

tilelang/language/pdl.py

src/transform/lower_pdl.cc

testing/python/jit/test_tilelang_jit_cython.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/transform/lower_pdl.cc (3)

9-9: Remove duplicate include.

The header tvm/tir/builtin.h is included twice (line 9 and line 14).
🔎 Proposed fix
 #include "../op/builtin.h"
 #include "../target/utils.h"
 #include "tvm/ir/type.h"
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>
Also applies to: 14-14

58-74: Extract duplicated logic into a helper function.

The CheckCall methods in both MarkCudaSyncCalls (lines 58-74) and EliminateCudaSyncCalls (lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.
🔎 Proposed refactor

Add a helper function in an anonymous namespace:
namespace {
// Returns true if the call is a CUDA PDL sync call
bool IsPDLSyncCall(const tir::CallNode *call) {
  if (!call)
    return false;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
               func_name == "cudaGridDependencySynchronize";
      }
    }
  }
  return false;
}

// Returns the specific PDL sync type (0=none, 1=trigger, 2=sync)
int GetPDLSyncType(const tir::CallNode *call) {
  if (!call)
    return 0;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
          return 1;
        } else if (func_name == "cudaGridDependencySynchronize") {
          return 2;
        }
      }
    }
  }
  return 0;
}
} // anonymous namespace
Then update both classes to use these helpers.
Also applies to: 111-130

138-145: Consider more descriptive pass name.

The pass name "tl.MarkCudaSyncCalls" only mentions "Mark" behavior, but the pass can also eliminate calls when have_pdl is false. Consider a name like "tl.ProcessCudaSyncCalls" or document the dual behavior clearly.

tilelang/jit/adapter/nvrtc/wrapper.py (1)

479-486: Docstring enhancement suggestion (optional):

The docstring is functional but could be improved to match the detail level of generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure of pdl_sync_map, and when/why this returns non-empty code:

    def generate_pdl_sync_code(self, function_name: str) -> str:
-        """
-        Generate Python code to insert PDL synchronization for a given kernel.
-        """
+        """Generate Python code to configure PDL synchronization for a kernel.
+
+        PDL (Programmatic Device Launch) enables programmatic stream serialization
+        for fine-grained kernel synchronization on compatible CUDA architectures.
+
+        Args:
+            function_name: Kernel name to check for PDL sync config
+
+        Returns:
+            Python code that sets programmatic stream serialization attributes,
+            or empty string if PDL sync is not configured for this kernel.
+        """

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f50aa99 and cb052f4.

📒 Files selected for processing (2)

src/transform/lower_pdl.cc
tilelang/jit/adapter/nvrtc/wrapper.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

🧬 Code graph analysis (1)

src/transform/lower_pdl.cc (2)

tilelang/language/tir/op.py (1)

call_extern (173-195)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

🪛 Cppcheck (2.18.0)

src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (5)

src/transform/lower_pdl.cc (2)

29-81: LGTM!

The MarkCudaSyncCalls class correctly implements the TVM StmtExprMutator pattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.

147-151: LGTM! Static analysis false positive can be ignored.

The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro, which is a valid TVM construct.

tilelang/jit/adapter/nvrtc/wrapper.py (3)

172-186: LGTM! Past config overwriting issue has been resolved.

The PDL synchronization code correctly assumes the config object already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder {11}) ensures this code executes after config = CUlaunchConfig() is created at line 198.

207-207: LGTM! PDL sync placeholder correctly positioned.

The placeholder {11} is correctly placed after config creation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.

423-424: LGTM! PDL sync code correctly integrated.

The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the {11} placeholder in KERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.

Also applies to: 438-438

silentCoder-dev · 2025-12-23T06:11:46Z

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

src/transform/lower_pdl.cc

testing/python/jit/test_tilelang_jit_cython.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/transform/lower_pdl.cc (2)

58-74: Consider extracting duplicated CheckCall logic.

The logic for detecting CUDA PDL function names is duplicated between MarkCudaSyncCalls::CheckCall and EliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.

🔎 Proposed refactor to extract common helper

Add a helper function in the anonymous namespace:

+namespace {
+
+enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync };
+
+CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) {
+  if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) {
+    return CudaPdlCallType::kNone;
+  }
+  
+  if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
+    std::string func_name = str_node->value;
+    if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+      return CudaPdlCallType::kTriggerLaunch;
+    } else if (func_name == "cudaGridDependencySynchronize") {
+      return CudaPdlCallType::kGridSync;
+    }
+  }
+  return CudaPdlCallType::kNone;
+}
+
+} // namespace
+
 class MarkCudaSyncCalls : public StmtExprMutator {

Then simplify both CheckCall methods:

 private:
   void CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return;
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            has_grid_sync_ = true;
-          }
-        }
-      }
+    auto call_type = GetCudaPdlCallType(call);
+    if (call_type == CudaPdlCallType::kTriggerLaunch) {
+      has_trigger_launch_ = true;
+    } else if (call_type == CudaPdlCallType::kGridSync) {
+      has_grid_sync_ = true;
     }
   }

And in EliminateCudaSyncCalls:

 private:
   bool CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return false;
-
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            return true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            return true;
-          }
-        }
-      }
-    }
-
-    return false;
+    return GetCudaPdlCallType(call) != CudaPdlCallType::kNone;
   }

Also applies to: 111-130

138-145: Pass name doesn't reflect dual behavior.

The pass is named "tl.MarkCudaSyncCalls" but it performs two different operations based on the have_pdl parameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like "tl.HandleCudaSyncCalls" or "tl.ProcessCudaSyncCalls" to better reflect this dual behavior.

🔎 Proposed rename

-  return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {});
+  return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb052f4 and 466c94a.

📒 Files selected for processing (1)

src/transform/lower_pdl.cc

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

🧬 Code graph analysis (1)

src/transform/lower_pdl.cc (2)

tilelang/language/tir/op.py (1)

call_extern (173-195)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

🪛 Cppcheck (2.19.0)

src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (1)

src/transform/lower_pdl.cc (1)

147-151: Static analysis syntax error is a false positive.

The Cppcheck syntax error reported at line 147 is a false positive. The TVM_FFI_STATIC_INIT_BLOCK macro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.

w169q169 · 2025-12-23T06:36:18Z

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine.
The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there.
If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

silentCoder-dev · 2025-12-23T06:53:05Z

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

w169q169 · 2025-12-23T09:47:53Z

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604
Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

testing/python/jit/test_tilelang_jit_cython.py (1)
41-43: Consider optimizing output tensor initialization.

Tensors b and c are initialized with torch.randn but are immediately overwritten by the kernel. Consider using torch.empty instead for better performance.
🔎 Proposed optimization
 a = torch.randn(N, dtype=torch.float32).cuda()
-b = torch.randn(N, dtype=torch.float32).cuda()
-c = torch.randn(N, dtype=torch.float32).cuda()
+b = torch.empty(N, dtype=torch.float32).cuda()
+c = torch.empty(N, dtype=torch.float32).cuda()

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 466c94a and 05b8dc0.

📒 Files selected for processing (1)

testing/python/jit/test_tilelang_jit_cython.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-14T07:56:11.098Z

Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

testing/python/jit/test_tilelang_jit_cython.py

🧬 Code graph analysis (1)

testing/python/jit/test_tilelang_jit_cython.py (4)

tilelang/language/kernel.py (1)

threads (214-218)

tilelang/language/loop.py (1)

Parallel (13-33)

tilelang/language/pdl.py (2)

pdl_trigger (10-14)

pdl_sync (17-21)

tilelang/utils/tensor.py (1)

torch_assert_close (231-319)

🔇 Additional comments (5)

testing/python/jit/test_tilelang_jit_cython.py (5)

1-6: LGTM - Imports are correct.

All necessary imports are present for the PDL test functionality.

8-11: LGTM - Test setup is appropriate.

The test function is properly defined with a reasonable test size for validating basic PDL functionality.

14-35: LGTM - PDL usage and kernel logic are correct.

The PDL primitives are used correctly:

T.pdl_trigger() is called after the first kernel completes its computation, signaling completion

T.pdl_sync() is called before the second kernel begins processing, ensuring proper synchronization

The kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).

48-55: LGTM - Test execution and verification are correct.

The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.

58-59: LGTM - Main guard is correctly implemented.

The main guard follows the standard pattern for tilelang test files.

testing/python/jit/test_tilelang_jit_cython.py

silentCoder-dev · 2025-12-23T10:24:41Z

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604 Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding has_programmatic_dependent_launch soon after that pr is merged.

LeiWang1999 · 2025-12-23T11:53:58Z

Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :)

src/transform/lower_pdl.cc

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/transform/lower_pdl.cc (1)
71-113: Consider refactoring duplicate PDL function name checks.

The CheckCall logic (lines 90-109) duplicates the string matching from MarkCudaSyncCalls::CheckCall (lines 46-62). Both methods check for the same two CUDA PDL function names.
🔎 Suggested refactor to reduce duplication

Extract the function name detection into a shared helper:
namespace {
bool IsCudaPdlFunction(const std::string& func_name) {
  return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
         func_name == "cudaGridDependencySynchronize";
}
} // namespace
Then simplify both CheckCall methods to use the helper. This centralizes the PDL function names to a single location.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62ed58f and 46fe83f.

📒 Files selected for processing (2)

src/transform/common/attr.h
src/transform/lower_pdl.cc

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-12-15T08:56:23.827Z

Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

src/transform/common/attr.h

📚 Learning: 2025-09-12T09:47:46.474Z

Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

src/transform/lower_pdl.cc

🧬 Code graph analysis (1)

src/transform/lower_pdl.cc (2)

tilelang/language/tir/op.py (1)

call_extern (173-195)

tilelang/contrib/nvcc.py (1)

have_pdl (590-595)

🪛 Cppcheck (2.19.0)

src/transform/lower_pdl.cc

[error] 126-126: syntax error

(syntaxError)

🔇 Additional comments (5)

src/transform/common/attr.h (1)

14-18: LGTM! Clean attribute namespace organization.

The new attr namespace properly scopes the PDL-related attribute constants with clear naming and appropriate visibility for use across the codebase.

src/transform/lower_pdl.cc (4)

1-23: LGTM! Proper file structure and includes.

The file header, includes, and namespace setup follow TVM conventions correctly.

24-69: LGTM! Correct PDL call detection and annotation.

The MarkCudaSyncCalls class properly scans for CUDA PDL extern calls and annotates the PrimFunc with the appropriate attributes. The logic correctly identifies cudaTriggerProgrammaticLaunchCompletion and cudaGridDependencySynchronize calls.

117-124: LGTM! Conditional pass logic is correct.

The pass factory correctly dispatches to either mark or eliminate PDL calls based on the have_pdl flag, which depends on CUDA compute capability.

126-130: LGTM! FFI registration follows TVM conventions.

The pass registration is correct. The static analysis syntax error is a false positive—Cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro.

w169q169 · 2025-12-24T08:50:57Z

CI is failing with ruff-check (F821 Undefined name 'autotune') in an example file that is untouched by this PR.
The same code passes locally.
Not sure if this is caused by CI ruff config or environment differences.
Any help would be appreciated.

silentCoder-dev · 2025-12-24T09:18:26Z

CI is failing with ruff-check (F821 Undefined name 'autotune') in an example file that is untouched by this PR. The same code passes locally. Not sure if this is caused by CI ruff config or environment differences. Any help would be appreciated.

You can update your local branch by pulling from tilelang/main.

silentCoder-dev

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

[Feat] PDL Support

2a67387

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

w169q169 added 3 commits December 22, 2025 16:17

remove unused comments/codes

4898b56

fix mismatch funcname

f50aa99

Resolve duplicate config issue for AI review

b846450

w169q169 mentioned this pull request Dec 22, 2025

[Feature Request] PDL Support #1463

Open

1 task

silentCoder-dev self-assigned this Dec 22, 2025

LeiWang1999 requested a review from silentCoder-dev December 22, 2025 08:37

silentCoder-dev reviewed Dec 22, 2025

View reviewed changes

src/transform/lower_pdl.cc Outdated Show resolved Hide resolved

silentCoder-dev reviewed Dec 22, 2025

View reviewed changes

testing/python/jit/test_tilelang_jit_cython.py Show resolved Hide resolved

fix typo error && clang-tidy error

cb052f4

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

chore: retrigger CI

8ed4124

w169q169 requested a review from silentCoder-dev December 23, 2025 06:03

silentCoder-dev reviewed Dec 23, 2025

View reviewed changes

src/transform/lower_pdl.cc Outdated Show resolved Hide resolved

silentCoder-dev reviewed Dec 23, 2025

View reviewed changes

testing/python/jit/test_tilelang_jit_cython.py Show resolved Hide resolved

change file comment

466c94a

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

rename test_tilelang_jit_ctypes.py to test_tilelang_jit_cython.py

05b8dc0

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

testing/python/jit/test_tilelang_jit_cython.py Outdated Show resolved Hide resolved

remove ctype backend

62ed58f

silentCoder-dev reviewed Dec 24, 2025

View reviewed changes

src/transform/lower_pdl.cc Outdated Show resolved Hide resolved

silentCoder-dev reviewed Dec 24, 2025

View reviewed changes

src/transform/lower_pdl.cc Outdated Show resolved Hide resolved

Move pdl attributes to header and remove redundant VisitStmt_

46fe83f

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

chore: trigger CI

898135e

silentCoder-dev requested changes Dec 24, 2025

View reviewed changes

silentCoder-dev self-requested a review December 24, 2025 13:10

[Feat] PDL Support #1494

Are you sure you want to change the base?

[Feat] PDL Support #1494

Uh oh!

Conversation

w169q169 commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

silentCoder-dev commented Dec 23, 2025

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

w169q169 commented Dec 23, 2025

Uh oh!

silentCoder-dev commented Dec 23, 2025

Uh oh!

w169q169 commented Dec 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

silentCoder-dev commented Dec 23, 2025

Uh oh!

LeiWang1999 commented Dec 23, 2025

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

w169q169 commented Dec 24, 2025

Uh oh!

silentCoder-dev commented Dec 24, 2025

Uh oh!

silentCoder-dev left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

w169q169 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading