Skip to content

Conversation

@w169q169
Copy link

@w169q169 w169q169 commented Dec 22, 2025

Add PDL support as per Issue #1463

Summary by CodeRabbit

  • New Features

    • Added pdl_trigger and pdl_sync primitives for CUDA kernel synchronization
    • Automatic detection of PDL-capable CUDA targets (compute ≥ 9.0)
    • Per-kernel dispatch path and build-time transform to annotate or remove PDL sync usage
  • Documentation

    • New "Synchronization helpers" subsection in the Instruction Reference documenting the primitives
  • Tests

    • New unit and integration tests exercising PDL trigger/sync across codegen and JIT backends

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Adds CUDA PDL support: two TileLang helpers (pdl_trigger, pdl_sync), a C++ IR pass to mark or eliminate corresponding extern CUDA calls based on target capability, target detection, compilation-pipeline integration, NVRTC/CUDA launch-path wiring, and unit/integration tests.

Changes

Cohort / File(s) Summary
Docs
docs/programming_guides/instructions.md
Documented new Synchronization helpers: pdl_trigger() and pdl_sync().
Language API
tilelang/language/pdl.py, tilelang/language/__init__.py
New helpers pdl_trigger() / pdl_sync() that emit extern CUDA calls and are re-exported from the package.
C++ transform pass
src/transform/lower_pdl.cc, src/transform/common/attr.h
Added MarkCudaSyncCalls and EliminateCudaSyncCalls mutators; new pass factory MarkCudaSyncCallsPass(bool have_pdl); new attribute names has_cuda_pdl_trigger / has_cuda_pdl_sync.
Transform binding
tilelang/transform/__init__.py
Exposed MarkCudaSyncCalls(have_pdl: bool = False) Python wrapper for the C++ pass.
Target detection
tilelang/contrib/nvcc.py
Added have_pdl(target) to detect SM >= 9.x for enabling PDL paths.
Engine pipeline
tilelang/engine/phase.py
Conditionally injects the MarkCudaSyncCalls pass into OptimizeForTarget when have_pdl(target) is true.
CUDA host wrapper
tilelang/jit/adapter/wrapper.py
Added pdl_sync_map tracking, changed host function iteration to use name hints, and route marked functions to a programmatic stream-serialization launch path.
NVRTC adapter
tilelang/jit/adapter/nvrtc/wrapper.py
Added PDL_SYNC_PY snippet and generate_pdl_sync_code(function_name) to emit per-kernel PDL launch attribute code; injects snippet into generated launch code.
JIT tests
testing/python/jit/test_tilelang_jit_nvrtc.py, testing/python/jit/test_tilelang_jit_cython.py
Added integration tests exercising kernels using pdl_trigger/pdl_sync and validating runtime behavior.
Language tests
testing/python/language/test_tilelang_language_pdl.py
Added tests verifying codegen includes/omits CUDA PDL extern symbols across SM targets.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as User Code
    participant Lang as TileLang\n(pdl helpers)
    participant IR as TVM IR\n(call_extern / PrimFunc)
    participant Transform as MarkCudaSyncCallsPass
    participant Engine as OptimizeForTarget
    participant Wrapper as CUDA Host Wrapper
    participant NVRTC as NVRTC Adapter
    participant CUDA as CUDA Runtime

    User->>Lang: call pdl_trigger()/pdl_sync()
    Lang->>IR: emit tir.call_extern("cudaTriggerProgrammaticLaunchCompletion" / "cudaGridDependencySynchronize")
    Engine->>Transform: Apply MarkCudaSyncCallsPass(have_pdl)
    alt have_pdl = true
        Transform->>IR: mark PrimFunc attrs\nhas_cuda_pdl_trigger / has_cuda_pdl_sync
    else have_pdl = false
        Transform->>IR: replace extern calls with 0
    end
    IR->>Wrapper: parse module, read function attrs
    Wrapper->>Wrapper: populate pdl_sync_map for marked functions
    Wrapper->>NVRTC: select launch path (PDL vs standard)
    NVRTC->>NVRTC: inject PDL_SYNC_PY if needed
    NVRTC->>CUDA: launch kernels (with/without programmatic serialization)
    CUDA->>CUDA: execute kernels
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • [Feature Request] PDL Support #1463 – Matches the PDL feature request (T.pdl_trigger / T.pdl_sync and PDL initialization); this PR implements those helpers, IR handling, and runtime wiring.

Possibly related PRs

Suggested reviewers

  • LeiWang1999

"I hop and code beneath the moon,
I signal kernels, then they tune,
A trigger here, a sync there too,
Streams march orderly — that's my cue 🐇✨"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.95% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Feat] PDL Support' clearly identifies the main feature being added (PDL support) and aligns with the comprehensive changeset covering PDL integration across language, transformation, JIT, and testing layers.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)

151-153: Consider truncating lib_code in error message.

Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting self.lib_code from the error, or logging it separately at debug level.

🔎 Proposed improvement
         if ret.returncode != 0:
             command_str = " ".join(command)
-            raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}")
+            raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")
testing/python/jit/test_tilelang_jit_nvrtc.py (1)

505-552: Consider adding a compute capability check for PDL support.

Based on tilelang/contrib/nvcc.py (lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, like test_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.

🔎 Proposed fix
 def test_nvrtc_pdl():
     """Test pdl."""
+    if not check_hopper():
+        import pytest
+
+        pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support")
 
     N = 64
tilelang/jit/adapter/wrapper.py (1)

213-213: Consider using set instead of dict for pdl_sync_map.

The pdl_sync_map dictionary always stores 0 as the value (line 450). If the value is unused, a set would be more appropriate and clearer in intent.

🔎 Proposed change
-        self.pdl_sync_map: dict[str, int] | None = {}
+        self.pdl_sync_map: set[str] | None = set()

And update usages:

-            if "has_cuda_pdl_sync" in attrs:
-                self.pdl_sync_map[function_name] = 0
+            if "has_cuda_pdl_sync" in attrs:
+                self.pdl_sync_map.add(function_name)
tilelang/transform/__init__.py (1)

403-406: LGTM with suggestion: Consider expanding the docstring.

The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the have_pdl parameter and return type.

🔎 Suggested docstring enhancement
 def MarkCudaSyncCalls(have_pdl: bool = False):
-    """MarkCudaSyncCalls"""
+    """Mark CUDA synchronization calls for PDL support.
+
+    Parameters
+    ----------
+    have_pdl : bool
+        Whether the target supports PDL (Programmatic Device Launch).
+        PDL is available on compute capability >= 9.0 (Hopper+).
+
+    Returns
+    -------
+    fpass : tvm.transform.Pass
+        The result pass
+    """
     return _ffi_api.MarkCudaSyncCalls(have_pdl)  # type: ignore
tilelang/contrib/nvcc.py (1)

590-596: Minor: Prefix unused variable with underscore.

The minor variable from parse_compute_version is unpacked but never used. This is a valid static analysis hint.

Proposed fix
 def have_pdl(target):
     if target.kind.name != "cuda":
         return False
     compute_version = get_target_compute_version(target)
-    major, minor = parse_compute_version(compute_version)
+    major, _ = parse_compute_version(compute_version)
     return major >= 9
testing/python/jit/test_tilelang_jit_ctypes.py (1)

1-1: Remove unused import.

The tvm import is not used in this test file.

Proposed fix
-from tilelang import tvm as tvm
 import tilelang.language as T
src/transform/lower_pdl.cc (2)

83-83: Typo in class name: ElininateCudaSyncCallsEliminateCudaSyncCalls.

The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.

Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator {
+class EliminateCudaSyncCalls : public StmtExprMutator {
 public:
   static PrimFunc Substitute(PrimFunc f) {
-    ElininateCudaSyncCalls mutator;
+    EliminateCudaSyncCalls mutator;
     PrimFunc new_f = f;
     new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body);
 
     return new_f;
   }
   // ... rest of class ...
 private:
-  ElininateCudaSyncCalls() = default;
+  EliminateCudaSyncCalls() = default;
 };

Also update line 141:

   return have_pdl ? MarkCudaSyncCalls::Substitute(f)
-                  : ElininateCudaSyncCalls::Substitute(f);
+                  : EliminateCudaSyncCalls::Substitute(f);

57-78: Duplicate private: access specifier.

There are two private: access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.

Proposed fix
-private:
   void CheckCall(const tir::CallNode *call) {
     if (!call)
       return;
     // ... implementation ...
   }

-private:
   bool has_trigger_launch_ = false;
   bool has_grid_sync_ = false;

   MarkCudaSyncCalls() = default;
+
+private:
+  void CheckCall(const tir::CallNode *call) {
+    if (!call)
+      return;
+    // ... implementation ...
+  }
+
+  bool has_trigger_launch_ = false;
+  bool has_grid_sync_ = false;
+
+  MarkCudaSyncCalls() = default;
 };

Note: The Cppcheck "syntax error" on line 147 is a false positive — TVM_FFI_STATIC_INIT_BLOCK is a valid TVM macro for FFI registration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a431797 and 2a67387.

📒 Files selected for processing (13)
  • docs/programming_guides/instructions.md
  • src/transform/lower_pdl.cc
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/language/test_tilelang_language_pdl.py
  • tilelang/contrib/nvcc.py
  • tilelang/engine/phase.py
  • tilelang/jit/adapter/libgen.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
  • tilelang/jit/adapter/wrapper.py
  • tilelang/language/__init__.py
  • tilelang/language/pdl.py
  • tilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_ctypes.py
  • tilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
  • MarkCudaSyncCalls (80-80)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/transform/__init__.py (1)
  • MarkCudaSyncCalls (403-405)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
  • test_nvrtc_pdl (505-552)
  • multi_kernels_with_pdl (511-532)
  • main (30-50)
  • main (138-158)
  • main (384-413)
  • main (513-530)
  • kernel (469-484)
testing/python/language/test_tilelang_language_pdl.py (2)
  • main (7-16)
  • main (23-32)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
  • parse_compute_version (179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
  • parse_function_call_args (315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
  • lib_code (287-289)
tilelang/jit/adapter/cython/adapter.py (1)
  • lib_code (371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🪛 Ruff (0.14.8)
tilelang/language/__init__.py

120-120: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


121-121: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tilelang/language/pdl.py

4-7: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tilelang/contrib/nvcc.py

594-594: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tilelang/jit/adapter/libgen.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)

141-155: LGTM: PDL kernel launch code structure looks correct.

The KERNEL_LAUNCH_FUNC_CODE correctly configures cudaLaunchConfig_t with cudaLaunchAttributeProgrammaticStreamSerialization and uses cudaLaunchKernelEx for the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.


313-330: LGTM: PDL branch in dispatch function is correctly structured.

The new PDL path properly:

  1. Parses function call arguments
  2. Validates argument count matches parameters
  3. Uses KERNEL_LAUNCH_FUNC_CODE for kernel launch with PDL attributes
  4. Includes error checking after launch

448-450: LGTM: PDL attribute detection correctly populates the sync map.

The has_cuda_pdl_sync attribute is checked and the function is registered in pdl_sync_map for later use during code generation.

tilelang/language/__init__.py (1)

119-123: LGTM!

The PDL function re-exports follow the established pattern in this file. The # noqa: F401 comments are appropriate for suppressing "imported but unused" warnings in __init__.py files intended to expose public API.

docs/programming_guides/instructions.md (1)

141-144: LGTM!

Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.

tilelang/engine/phase.py (1)

255-257: LGTM!

The MarkCudaSyncCalls pass is correctly integrated after SplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.

testing/python/language/test_tilelang_language_pdl.py (1)

37-61: Test implementation looks good.

The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the MarkCudaSyncCalls pass behavior.

src/transform/lower_pdl.cc (1)

29-43: LGTM on the MarkCudaSyncCalls implementation.

The mutator correctly scans for CUDA sync extern calls and annotates the PrimFunc with has_cuda_pdl_trigger and has_cuda_pdl_sync attributes. The pattern of visiting both statements and expressions ensures comprehensive detection.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)

9-9: Remove duplicate include.

The header tvm/tir/builtin.h is included twice (line 9 and line 14).

🔎 Proposed fix
 #include "../op/builtin.h"
 #include "../target/utils.h"
 #include "tvm/ir/type.h"
 #include "tvm/tir/builtin.h"
 #include "tvm/tir/expr.h"
 #include "tvm/tir/stmt.h"
 #include <tvm/ffi/reflection/registry.h>
 #include <tvm/tir/analysis.h>
-#include <tvm/tir/builtin.h>
 #include <tvm/tir/stmt_functor.h>
 #include <tvm/tir/transform.h>

Also applies to: 14-14


58-74: Extract duplicated logic into a helper function.

The CheckCall methods in both MarkCudaSyncCalls (lines 58-74) and EliminateCudaSyncCalls (lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.

🔎 Proposed refactor

Add a helper function in an anonymous namespace:

namespace {
// Returns true if the call is a CUDA PDL sync call
bool IsPDLSyncCall(const tir::CallNode *call) {
  if (!call)
    return false;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
               func_name == "cudaGridDependencySynchronize";
      }
    }
  }
  return false;
}

// Returns the specific PDL sync type (0=none, 1=trigger, 2=sync)
int GetPDLSyncType(const tir::CallNode *call) {
  if (!call)
    return 0;
  
  if (call->op.same_as(builtin::call_extern())) {
    if (!call->args.empty()) {
      if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
        std::string func_name = str_node->value;
        if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
          return 1;
        } else if (func_name == "cudaGridDependencySynchronize") {
          return 2;
        }
      }
    }
  }
  return 0;
}
} // anonymous namespace

Then update both classes to use these helpers.

Also applies to: 111-130


138-145: Consider more descriptive pass name.

The pass name "tl.MarkCudaSyncCalls" only mentions "Mark" behavior, but the pass can also eliminate calls when have_pdl is false. Consider a name like "tl.ProcessCudaSyncCalls" or document the dual behavior clearly.

tilelang/jit/adapter/nvrtc/wrapper.py (1)

479-486: Docstring enhancement suggestion (optional):

The docstring is functional but could be improved to match the detail level of generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure of pdl_sync_map, and when/why this returns non-empty code:

    def generate_pdl_sync_code(self, function_name: str) -> str:
-        """
-        Generate Python code to insert PDL synchronization for a given kernel.
-        """
+        """Generate Python code to configure PDL synchronization for a kernel.
+
+        PDL (Programmatic Device Launch) enables programmatic stream serialization
+        for fine-grained kernel synchronization on compatible CUDA architectures.
+
+        Args:
+            function_name: Kernel name to check for PDL sync config
+
+        Returns:
+            Python code that sets programmatic stream serialization attributes,
+            or empty string if PDL sync is not configured for this kernel.
+        """
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f50aa99 and cb052f4.

📒 Files selected for processing (2)
  • src/transform/lower_pdl.cc
  • tilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)

29-81: LGTM!

The MarkCudaSyncCalls class correctly implements the TVM StmtExprMutator pattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.


147-151: LGTM! Static analysis false positive can be ignored.

The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro, which is a valid TVM construct.

tilelang/jit/adapter/nvrtc/wrapper.py (3)

172-186: LGTM! Past config overwriting issue has been resolved.

The PDL synchronization code correctly assumes the config object already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder {11}) ensures this code executes after config = CUlaunchConfig() is created at line 198.


207-207: LGTM! PDL sync placeholder correctly positioned.

The placeholder {11} is correctly placed after config creation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.


423-424: LGTM! PDL sync code correctly integrated.

The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the {11} placeholder in KERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.

Also applies to: 438-438

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)

58-74: Consider extracting duplicated CheckCall logic.

The logic for detecting CUDA PDL function names is duplicated between MarkCudaSyncCalls::CheckCall and EliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.

🔎 Proposed refactor to extract common helper

Add a helper function in the anonymous namespace:

+namespace {
+
+enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync };
+
+CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) {
+  if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) {
+    return CudaPdlCallType::kNone;
+  }
+  
+  if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) {
+    std::string func_name = str_node->value;
+    if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
+      return CudaPdlCallType::kTriggerLaunch;
+    } else if (func_name == "cudaGridDependencySynchronize") {
+      return CudaPdlCallType::kGridSync;
+    }
+  }
+  return CudaPdlCallType::kNone;
+}
+
+} // namespace
+
 class MarkCudaSyncCalls : public StmtExprMutator {

Then simplify both CheckCall methods:

 private:
   void CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return;
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            has_trigger_launch_ = true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            has_grid_sync_ = true;
-          }
-        }
-      }
+    auto call_type = GetCudaPdlCallType(call);
+    if (call_type == CudaPdlCallType::kTriggerLaunch) {
+      has_trigger_launch_ = true;
+    } else if (call_type == CudaPdlCallType::kGridSync) {
+      has_grid_sync_ = true;
     }
   }

And in EliminateCudaSyncCalls:

 private:
   bool CheckCall(const tir::CallNode *call) {
-    if (!call)
-      return false;
-
-    if (call->op.same_as(builtin::call_extern())) {
-      if (!call->args.empty()) {
-        if (const auto *str_node =
-                call->args[0].as<tvm::tir::StringImmNode>()) {
-          std::string func_name = str_node->value;
-          if (func_name == "cudaTriggerProgrammaticLaunchCompletion") {
-            return true;
-          } else if (func_name == "cudaGridDependencySynchronize") {
-            return true;
-          }
-        }
-      }
-    }
-
-    return false;
+    return GetCudaPdlCallType(call) != CudaPdlCallType::kNone;
   }

Also applies to: 111-130


138-145: Pass name doesn't reflect dual behavior.

The pass is named "tl.MarkCudaSyncCalls" but it performs two different operations based on the have_pdl parameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like "tl.HandleCudaSyncCalls" or "tl.ProcessCudaSyncCalls" to better reflect this dual behavior.

🔎 Proposed rename
-  return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {});
+  return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb052f4 and 466c94a.

📒 Files selected for processing (1)
  • src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 147-147: syntax error

(syntaxError)

🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)

147-151: Static analysis syntax error is a false positive.

The Cppcheck syntax error reported at line 147 is a false positive. The TVM_FFI_STATIC_INIT_BLOCK macro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.

@w169q169
Copy link
Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine.
The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there.
If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.

In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

@w169q169
Copy link
Author

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604
Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)

41-43: Consider optimizing output tensor initialization.

Tensors b and c are initialized with torch.randn but are immediately overwritten by the kernel. Consider using torch.empty instead for better performance.

🔎 Proposed optimization
 a = torch.randn(N, dtype=torch.float32).cuda()
-b = torch.randn(N, dtype=torch.float32).cuda()
-c = torch.randn(N, dtype=torch.float32).cuda()
+b = torch.empty(N, dtype=torch.float32).cuda()
+c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 466c94a and 05b8dc0.

📒 Files selected for processing (1)
  • testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
  • threads (214-218)
tilelang/language/loop.py (1)
  • Parallel (13-33)
tilelang/language/pdl.py (2)
  • pdl_trigger (10-14)
  • pdl_sync (17-21)
tilelang/utils/tensor.py (1)
  • torch_assert_close (231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)

1-6: LGTM - Imports are correct.

All necessary imports are present for the PDL test functionality.


8-11: LGTM - Test setup is appropriate.

The test function is properly defined with a reasonable test size for validating basic PDL functionality.


14-35: LGTM - PDL usage and kernel logic are correct.

The PDL primitives are used correctly:

  • T.pdl_trigger() is called after the first kernel completes its computation, signaling completion
  • T.pdl_sync() is called before the second kernel begins processing, ensuring proper synchronization

The kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).


48-55: LGTM - Test execution and verification are correct.

The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.


58-59: LGTM - Main guard is correctly implemented.

The main guard follows the standard pattern for tilelang test files.

@silentCoder-dev
Copy link
Collaborator

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it.

Thanks for the clarification. From my understanding, the Cython backend itself is fine. The difficulty we’re running into is that PDL requires modifying the launchConfig for the secondary kernel, and I’m not sure how this can be done cleanly when going through tvm_ffi.
In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. If there is a supported or recommended way to modify the cudaLaunchKernelEx config in the tvm_ffi flow, that would be ideal. Otherwise, this seems to be the main blocker for enabling PDL in this backend.

You can manipulate this in src/runtime/runtime.cc. For example, see the test test_tvm_ffi_l2_persistent_map in testing/python/jit/test_tilelang_jit_tvm_ffi.py, which invokes tvm_cuda_stream_set_access_policy_window.

Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.

tvm_cuda_stream_set_access_policy_window only modifies the behavior of the CUDA stream, which is fine for L2 persistence, but PDL specifically requires modifying the launchConfig of the kernel. In the tvm_ffi path, cudaLaunchKernel is implemented inside TVM itself, at:

https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207

There is an existing draft PR here: apache/tvm#18604 Once this is merged in TVM, we can continue implementing PDL support in the tvm_ffi backend.

I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding has_programmatic_dependent_launch soon after that pr is merged.

@LeiWang1999
Copy link
Member

Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/transform/lower_pdl.cc (1)

71-113: Consider refactoring duplicate PDL function name checks.

The CheckCall logic (lines 90-109) duplicates the string matching from MarkCudaSyncCalls::CheckCall (lines 46-62). Both methods check for the same two CUDA PDL function names.

🔎 Suggested refactor to reduce duplication

Extract the function name detection into a shared helper:

namespace {
bool IsCudaPdlFunction(const std::string& func_name) {
  return func_name == "cudaTriggerProgrammaticLaunchCompletion" ||
         func_name == "cudaGridDependencySynchronize";
}
} // namespace

Then simplify both CheckCall methods to use the helper. This centralizes the PDL function names to a single location.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62ed58f and 46fe83f.

📒 Files selected for processing (2)
  • src/transform/common/attr.h
  • src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.

Applied to files:

  • src/transform/common/attr.h
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.

Applied to files:

  • src/transform/lower_pdl.cc
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
  • call_extern (173-195)
tilelang/contrib/nvcc.py (1)
  • have_pdl (590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc

[error] 126-126: syntax error

(syntaxError)

🔇 Additional comments (5)
src/transform/common/attr.h (1)

14-18: LGTM! Clean attribute namespace organization.

The new attr namespace properly scopes the PDL-related attribute constants with clear naming and appropriate visibility for use across the codebase.

src/transform/lower_pdl.cc (4)

1-23: LGTM! Proper file structure and includes.

The file header, includes, and namespace setup follow TVM conventions correctly.


24-69: LGTM! Correct PDL call detection and annotation.

The MarkCudaSyncCalls class properly scans for CUDA PDL extern calls and annotates the PrimFunc with the appropriate attributes. The logic correctly identifies cudaTriggerProgrammaticLaunchCompletion and cudaGridDependencySynchronize calls.


117-124: LGTM! Conditional pass logic is correct.

The pass factory correctly dispatches to either mark or eliminate PDL calls based on the have_pdl flag, which depends on CUDA compute capability.


126-130: LGTM! FFI registration follows TVM conventions.

The pass registration is correct. The static analysis syntax error is a false positive—Cppcheck doesn't recognize the TVM_FFI_STATIC_INIT_BLOCK macro.

@w169q169
Copy link
Author

image

CI is failing with ruff-check (F821 Undefined name 'autotune') in an example file that is untouched by this PR.
The same code passes locally.
Not sure if this is caused by CI ruff config or environment differences.
Any help would be appreciated.

@silentCoder-dev
Copy link
Collaborator

image CI is failing with ruff-check (F821 Undefined name 'autotune') in an example file that is untouched by this PR. The same code passes locally. Not sure if this is caused by CI ruff config or environment differences. Any help would be appreciated.

You can update your local branch by pulling from tilelang/main.

Copy link
Collaborator

@silentCoder-dev silentCoder-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.

@silentCoder-dev silentCoder-dev self-requested a review December 24, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants