-
Notifications
You must be signed in to change notification settings - Fork 357
[Feat] PDL Support #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feat] PDL Support #1494
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughAdds CUDA PDL support: two TileLang helpers (pdl_trigger, pdl_sync), a C++ IR pass to mark or eliminate corresponding extern CUDA calls based on target capability, target detection, compilation-pipeline integration, NVRTC/CUDA launch-path wiring, and unit/integration tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User Code
participant Lang as TileLang\n(pdl helpers)
participant IR as TVM IR\n(call_extern / PrimFunc)
participant Transform as MarkCudaSyncCallsPass
participant Engine as OptimizeForTarget
participant Wrapper as CUDA Host Wrapper
participant NVRTC as NVRTC Adapter
participant CUDA as CUDA Runtime
User->>Lang: call pdl_trigger()/pdl_sync()
Lang->>IR: emit tir.call_extern("cudaTriggerProgrammaticLaunchCompletion" / "cudaGridDependencySynchronize")
Engine->>Transform: Apply MarkCudaSyncCallsPass(have_pdl)
alt have_pdl = true
Transform->>IR: mark PrimFunc attrs\nhas_cuda_pdl_trigger / has_cuda_pdl_sync
else have_pdl = false
Transform->>IR: replace extern calls with 0
end
IR->>Wrapper: parse module, read function attrs
Wrapper->>Wrapper: populate pdl_sync_map for marked functions
Wrapper->>NVRTC: select launch path (PDL vs standard)
NVRTC->>NVRTC: inject PDL_SYNC_PY if needed
NVRTC->>CUDA: launch kernels (with/without programmatic serialization)
CUDA->>CUDA: execute kernels
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (8)
tilelang/jit/adapter/libgen.py (1)
151-153: Consider truncatinglib_codein error message.Including the full source code in the exception can produce extremely long error messages, especially for large kernels. Consider truncating or omitting
self.lib_codefrom the error, or logging it separately at debug level.🔎 Proposed improvement
if ret.returncode != 0: command_str = " ".join(command) - raise RuntimeError(f"Compilation Failed! {command_str}\n {self.lib_code}") + raise RuntimeError(f"Compilation Failed! Command: {command_str}\nSource saved at: {src.name}")testing/python/jit/test_tilelang_jit_nvrtc.py (1)
505-552: Consider adding a compute capability check for PDL support.Based on
tilelang/contrib/nvcc.py(lines 589-594), PDL requires compute capability >= 9.0 (Hopper or later). Other tests in this file, liketest_nvrtc_im2col_tma_desc, skip on non-Hopper GPUs. Consider adding a similar check here to avoid test failures on older hardware.🔎 Proposed fix
def test_nvrtc_pdl(): """Test pdl.""" + if not check_hopper(): + import pytest + + pytest.skip("Test requires Hopper GPU (compute capability 9.0) for PDL support") N = 64tilelang/jit/adapter/wrapper.py (1)
213-213: Consider usingsetinstead ofdictforpdl_sync_map.The
pdl_sync_mapdictionary always stores0as the value (line 450). If the value is unused, asetwould be more appropriate and clearer in intent.🔎 Proposed change
- self.pdl_sync_map: dict[str, int] | None = {} + self.pdl_sync_map: set[str] | None = set()And update usages:
- if "has_cuda_pdl_sync" in attrs: - self.pdl_sync_map[function_name] = 0 + if "has_cuda_pdl_sync" in attrs: + self.pdl_sync_map.add(function_name)tilelang/transform/__init__.py (1)
403-406: LGTM with suggestion: Consider expanding the docstring.The function follows the established pattern for pass wrappers. Consider adding a more detailed docstring similar to other functions in this file, documenting the
have_pdlparameter and return type.🔎 Suggested docstring enhancement
def MarkCudaSyncCalls(have_pdl: bool = False): - """MarkCudaSyncCalls""" + """Mark CUDA synchronization calls for PDL support. + + Parameters + ---------- + have_pdl : bool + Whether the target supports PDL (Programmatic Device Launch). + PDL is available on compute capability >= 9.0 (Hopper+). + + Returns + ------- + fpass : tvm.transform.Pass + The result pass + """ return _ffi_api.MarkCudaSyncCalls(have_pdl) # type: ignoretilelang/contrib/nvcc.py (1)
590-596: Minor: Prefix unused variable with underscore.The
minorvariable fromparse_compute_versionis unpacked but never used. This is a valid static analysis hint.Proposed fix
def have_pdl(target): if target.kind.name != "cuda": return False compute_version = get_target_compute_version(target) - major, minor = parse_compute_version(compute_version) + major, _ = parse_compute_version(compute_version) return major >= 9testing/python/jit/test_tilelang_jit_ctypes.py (1)
1-1: Remove unused import.The
tvmimport is not used in this test file.Proposed fix
-from tilelang import tvm as tvm import tilelang.language as Tsrc/transform/lower_pdl.cc (2)
83-83: Typo in class name:ElininateCudaSyncCalls→EliminateCudaSyncCalls.The class name has a typo ("Elininate" instead of "Eliminate"). While this doesn't affect functionality, fixing it improves code readability and maintainability.
Proposed fix
-class ElininateCudaSyncCalls : public StmtExprMutator { +class EliminateCudaSyncCalls : public StmtExprMutator { public: static PrimFunc Substitute(PrimFunc f) { - ElininateCudaSyncCalls mutator; + EliminateCudaSyncCalls mutator; PrimFunc new_f = f; new_f.CopyOnWrite()->body = mutator.VisitStmt(f->body); return new_f; } // ... rest of class ... private: - ElininateCudaSyncCalls() = default; + EliminateCudaSyncCalls() = default; };Also update line 141:
return have_pdl ? MarkCudaSyncCalls::Substitute(f) - : ElininateCudaSyncCalls::Substitute(f); + : EliminateCudaSyncCalls::Substitute(f);
57-78: Duplicateprivate:access specifier.There are two
private:access specifiers (lines 57 and 76). Consider consolidating them for cleaner code organization.Proposed fix
-private: void CheckCall(const tir::CallNode *call) { if (!call) return; // ... implementation ... } -private: bool has_trigger_launch_ = false; bool has_grid_sync_ = false; MarkCudaSyncCalls() = default; + +private: + void CheckCall(const tir::CallNode *call) { + if (!call) + return; + // ... implementation ... + } + + bool has_trigger_launch_ = false; + bool has_grid_sync_ = false; + + MarkCudaSyncCalls() = default; };Note: The Cppcheck "syntax error" on line 147 is a false positive —
TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for FFI registration.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
docs/programming_guides/instructions.mdsrc/transform/lower_pdl.cctesting/python/jit/test_tilelang_jit_ctypes.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/language/test_tilelang_language_pdl.pytilelang/contrib/nvcc.pytilelang/engine/phase.pytilelang/jit/adapter/libgen.pytilelang/jit/adapter/nvrtc/wrapper.pytilelang/jit/adapter/wrapper.pytilelang/language/__init__.pytilelang/language/pdl.pytilelang/transform/__init__.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_ctypes.pytilelang/jit/adapter/nvrtc/wrapper.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_pdl.py
🧬 Code graph analysis (10)
tilelang/transform/__init__.py (2)
src/transform/lower_pdl.cc (1)
MarkCudaSyncCalls(80-80)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/language/__init__.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
tilelang/language/pdl.py (1)
tilelang/language/tir/op.py (1)
call_extern(173-195)
tilelang/engine/phase.py (3)
tilelang/contrib/nvcc.py (1)
have_pdl(590-595)tilelang/transform/__init__.py (1)
MarkCudaSyncCalls(403-405)tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_pdl.py (1)
tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
testing/python/jit/test_tilelang_jit_ctypes.py (3)
testing/python/jit/test_tilelang_jit_nvrtc.py (7)
test_nvrtc_pdl(505-552)multi_kernels_with_pdl(511-532)main(30-50)main(138-158)main(384-413)main(513-530)kernel(469-484)testing/python/language/test_tilelang_language_pdl.py (2)
main(7-16)main(23-32)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
tilelang/contrib/nvcc.py (1)
tilelang/contrib/rocm.py (1)
parse_compute_version(179-201)
tilelang/jit/adapter/wrapper.py (1)
tilelang/jit/adapter/utils.py (1)
parse_function_call_args(315-357)
tilelang/jit/adapter/libgen.py (2)
tilelang/jit/adapter/ctypes/adapter.py (1)
lib_code(287-289)tilelang/jit/adapter/cython/adapter.py (1)
lib_code(371-373)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🪛 Ruff (0.14.8)
tilelang/language/__init__.py
120-120: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
121-121: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
tilelang/language/pdl.py
4-7: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tilelang/contrib/nvcc.py
594-594: Unpacked variable minor is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tilelang/jit/adapter/libgen.py
153-153: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
tilelang/jit/adapter/wrapper.py (3)
141-155: LGTM: PDL kernel launch code structure looks correct.The
KERNEL_LAUNCH_FUNC_CODEcorrectly configurescudaLaunchConfig_twithcudaLaunchAttributeProgrammaticStreamSerializationand usescudaLaunchKernelExfor the kernel launch. The attribute setup follows CUDA's programmatic stream serialization pattern.
313-330: LGTM: PDL branch in dispatch function is correctly structured.The new PDL path properly:
- Parses function call arguments
- Validates argument count matches parameters
- Uses
KERNEL_LAUNCH_FUNC_CODEfor kernel launch with PDL attributes- Includes error checking after launch
448-450: LGTM: PDL attribute detection correctly populates the sync map.The
has_cuda_pdl_syncattribute is checked and the function is registered inpdl_sync_mapfor later use during code generation.tilelang/language/__init__.py (1)
119-123: LGTM!The PDL function re-exports follow the established pattern in this file. The
# noqa: F401comments are appropriate for suppressing "imported but unused" warnings in__init__.pyfiles intended to expose public API.docs/programming_guides/instructions.md (1)
141-144: LGTM!Documentation for the new PDL synchronization helpers is clear and follows the established format of the instruction reference.
tilelang/engine/phase.py (1)
255-257: LGTM!The
MarkCudaSyncCallspass is correctly integrated afterSplitHostDevice. The unconditional application is appropriate since the pass internally handles both cases: marking sync calls when PDL is available (sm_90+) or eliminating them for older architectures.testing/python/language/test_tilelang_language_pdl.py (1)
37-61: Test implementation looks good.The tests correctly verify that PDL synchronization calls are present in generated code for sm_90 (Hopper+) and absent for sm_75 (pre-Hopper). This validates the
MarkCudaSyncCallspass behavior.src/transform/lower_pdl.cc (1)
29-43: LGTM on the MarkCudaSyncCalls implementation.The mutator correctly scans for CUDA sync extern calls and annotates the
PrimFuncwithhas_cuda_pdl_triggerandhas_cuda_pdl_syncattributes. The pattern of visiting both statements and expressions ensures comprehensive detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/transform/lower_pdl.cc (3)
9-9: Remove duplicate include.The header
tvm/tir/builtin.his included twice (line 9 and line 14).🔎 Proposed fix
#include "../op/builtin.h" #include "../target/utils.h" #include "tvm/ir/type.h" #include "tvm/tir/builtin.h" #include "tvm/tir/expr.h" #include "tvm/tir/stmt.h" #include <tvm/ffi/reflection/registry.h> #include <tvm/tir/analysis.h> -#include <tvm/tir/builtin.h> #include <tvm/tir/stmt_functor.h> #include <tvm/tir/transform.h>Also applies to: 14-14
58-74: Extract duplicated logic into a helper function.The
CheckCallmethods in bothMarkCudaSyncCalls(lines 58-74) andEliminateCudaSyncCalls(lines 111-130) contain nearly identical logic for detecting PDL sync function names. Consider extracting this into a common helper function to improve maintainability.🔎 Proposed refactor
Add a helper function in an anonymous namespace:
namespace { // Returns true if the call is a CUDA PDL sync call bool IsPDLSyncCall(const tir::CallNode *call) { if (!call) return false; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; return func_name == "cudaTriggerProgrammaticLaunchCompletion" || func_name == "cudaGridDependencySynchronize"; } } } return false; } // Returns the specific PDL sync type (0=none, 1=trigger, 2=sync) int GetPDLSyncType(const tir::CallNode *call) { if (!call) return 0; if (call->op.same_as(builtin::call_extern())) { if (!call->args.empty()) { if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { std::string func_name = str_node->value; if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { return 1; } else if (func_name == "cudaGridDependencySynchronize") { return 2; } } } } return 0; } } // anonymous namespaceThen update both classes to use these helpers.
Also applies to: 111-130
138-145: Consider more descriptive pass name.The pass name
"tl.MarkCudaSyncCalls"only mentions "Mark" behavior, but the pass can also eliminate calls whenhave_pdlis false. Consider a name like"tl.ProcessCudaSyncCalls"or document the dual behavior clearly.tilelang/jit/adapter/nvrtc/wrapper.py (1)
479-486: Docstring enhancement suggestion (optional):The docstring is functional but could be improved to match the detail level of
generate_l2_persistent_map. Consider documenting what PDL synchronization is, the expected structure ofpdl_sync_map, and when/why this returns non-empty code:def generate_pdl_sync_code(self, function_name: str) -> str: - """ - Generate Python code to insert PDL synchronization for a given kernel. - """ + """Generate Python code to configure PDL synchronization for a kernel. + + PDL (Programmatic Device Launch) enables programmatic stream serialization + for fine-grained kernel synchronization on compatible CUDA architectures. + + Args: + function_name: Kernel name to check for PDL sync config + + Returns: + Python code that sets programmatic stream serialization attributes, + or empty string if PDL sync is not configured for this kernel. + """
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/lower_pdl.cctilelang/jit/adapter/nvrtc/wrapper.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.18.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (5)
src/transform/lower_pdl.cc (2)
29-81: LGTM!The
MarkCudaSyncCallsclass correctly implements the TVMStmtExprMutatorpattern to detect PDL synchronization calls and mark the function with appropriate attributes. The logic is defensive and properly chains to the base visitor methods.
147-151: LGTM! Static analysis false positive can be ignored.The FFI registration correctly exposes the pass through TVM's reflection mechanism. The cppcheck syntax error is a false positive—cppcheck doesn't recognize the
TVM_FFI_STATIC_INIT_BLOCKmacro, which is a valid TVM construct.tilelang/jit/adapter/nvrtc/wrapper.py (3)
172-186: LGTM! Past config overwriting issue has been resolved.The PDL synchronization code correctly assumes the
configobject already exists and modifies it in place, rather than creating a new instance. The placement at line 207 (via placeholder{11}) ensures this code executes afterconfig = CUlaunchConfig()is created at line 198.
207-207: LGTM! PDL sync placeholder correctly positioned.The placeholder
{11}is correctly placed afterconfigcreation and before kernel launch, ensuring PDL attributes can be applied to the existing config object.
423-424: LGTM! PDL sync code correctly integrated.The per-kernel PDL synchronization code is generated and passed as the 11th format argument, correctly matching the
{11}placeholder inKERNEL_LAUNCH_FUNC_PY. The pattern follows the established approach used for L2 persistent map generation.Also applies to: 438-438
|
The Cython backend implementation is correct. The issue lies in the missing support for tvm_ffi. I'm currently working on adding it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/transform/lower_pdl.cc (2)
58-74: Consider extracting duplicated CheckCall logic.The logic for detecting CUDA PDL function names is duplicated between
MarkCudaSyncCalls::CheckCallandEliminateCudaSyncCalls::CheckCall. Consider extracting this into a shared helper function to improve maintainability.🔎 Proposed refactor to extract common helper
Add a helper function in the anonymous namespace:
+namespace { + +enum class CudaPdlCallType { kNone, kTriggerLaunch, kGridSync }; + +CudaPdlCallType GetCudaPdlCallType(const tir::CallNode *call) { + if (!call || !call->op.same_as(builtin::call_extern()) || call->args.empty()) { + return CudaPdlCallType::kNone; + } + + if (const auto *str_node = call->args[0].as<tvm::tir::StringImmNode>()) { + std::string func_name = str_node->value; + if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { + return CudaPdlCallType::kTriggerLaunch; + } else if (func_name == "cudaGridDependencySynchronize") { + return CudaPdlCallType::kGridSync; + } + } + return CudaPdlCallType::kNone; +} + +} // namespace + class MarkCudaSyncCalls : public StmtExprMutator {Then simplify both CheckCall methods:
private: void CheckCall(const tir::CallNode *call) { - if (!call) - return; - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - has_trigger_launch_ = true; - } else if (func_name == "cudaGridDependencySynchronize") { - has_grid_sync_ = true; - } - } - } + auto call_type = GetCudaPdlCallType(call); + if (call_type == CudaPdlCallType::kTriggerLaunch) { + has_trigger_launch_ = true; + } else if (call_type == CudaPdlCallType::kGridSync) { + has_grid_sync_ = true; } }And in
EliminateCudaSyncCalls:private: bool CheckCall(const tir::CallNode *call) { - if (!call) - return false; - - if (call->op.same_as(builtin::call_extern())) { - if (!call->args.empty()) { - if (const auto *str_node = - call->args[0].as<tvm::tir::StringImmNode>()) { - std::string func_name = str_node->value; - if (func_name == "cudaTriggerProgrammaticLaunchCompletion") { - return true; - } else if (func_name == "cudaGridDependencySynchronize") { - return true; - } - } - } - } - - return false; + return GetCudaPdlCallType(call) != CudaPdlCallType::kNone; }Also applies to: 111-130
138-145: Pass name doesn't reflect dual behavior.The pass is named
"tl.MarkCudaSyncCalls"but it performs two different operations based on thehave_pdlparameter: marking calls when true, or eliminating calls when false. Consider renaming to something more generic like"tl.HandleCudaSyncCalls"or"tl.ProcessCudaSyncCalls"to better reflect this dual behavior.🔎 Proposed rename
- return CreatePrimFuncPass(pass_func, 0, "tl.MarkCudaSyncCalls", {}); + return CreatePrimFuncPass(pass_func, 0, "tl.HandleCudaSyncCalls", {});
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 147-147: syntax error
(syntaxError)
🔇 Additional comments (1)
src/transform/lower_pdl.cc (1)
147-151: Static analysis syntax error is a false positive.The Cppcheck syntax error reported at line 147 is a false positive. The
TVM_FFI_STATIC_INIT_BLOCKmacro is a valid TVM framework construct for registering FFI functions at static initialization time. The implementation is correct.
Thanks for the clarification. From my understanding, the Cython backend itself is fine. In particular, with the tvm_ffi path we eventually call cudaLaunchKernelEx, but it’s not obvious to me where (or whether) we can intercept and adjust the cudaLaunchConfig_t there. |
You can manipulate this in |
Thanks for the suggestion. Unfortunately, this approach cannot achieve what we need.
https://github.com/apache/tvm/blob/main/src/runtime/cuda/cuda_module.cc#L184-L207 There is an existing draft PR here: apache/tvm#18604 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
testing/python/jit/test_tilelang_jit_cython.py (1)
41-43: Consider optimizing output tensor initialization.Tensors
bandcare initialized withtorch.randnbut are immediately overwritten by the kernel. Consider usingtorch.emptyinstead for better performance.🔎 Proposed optimization
a = torch.randn(N, dtype=torch.float32).cuda() -b = torch.randn(N, dtype=torch.float32).cuda() -c = torch.randn(N, dtype=torch.float32).cuda() +b = torch.empty(N, dtype=torch.float32).cuda() +c = torch.empty(N, dtype=torch.float32).cuda()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
testing/python/jit/test_tilelang_jit_cython.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_cython.py
🧬 Code graph analysis (1)
testing/python/jit/test_tilelang_jit_cython.py (4)
tilelang/language/kernel.py (1)
threads(214-218)tilelang/language/loop.py (1)
Parallel(13-33)tilelang/language/pdl.py (2)
pdl_trigger(10-14)pdl_sync(17-21)tilelang/utils/tensor.py (1)
torch_assert_close(231-319)
🔇 Additional comments (5)
testing/python/jit/test_tilelang_jit_cython.py (5)
1-6: LGTM - Imports are correct.All necessary imports are present for the PDL test functionality.
8-11: LGTM - Test setup is appropriate.The test function is properly defined with a reasonable test size for validating basic PDL functionality.
14-35: LGTM - PDL usage and kernel logic are correct.The PDL primitives are used correctly:
T.pdl_trigger()is called after the first kernel completes its computation, signaling completionT.pdl_sync()is called before the second kernel begins processing, ensuring proper synchronizationThe kernel logic correctly implements the expected computation pipeline (B = A + 1.0, then C = B * 2.0).
48-55: LGTM - Test execution and verification are correct.The test properly executes the kernel, computes reference results, and verifies correctness using appropriate tolerances.
58-59: LGTM - Main guard is correctly implemented.The main guard follows the standard pattern for tilelang test files.
I locally implement similar functionality with that draft PR (: I believe this implementation is sufficient, and we can proceed by adding |
|
Thanks. Since the ctypes execution backend was removed in #1510, we can also remove the related caching mechanisms in this PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/transform/lower_pdl.cc (1)
71-113: Consider refactoring duplicate PDL function name checks.The
CheckCalllogic (lines 90-109) duplicates the string matching fromMarkCudaSyncCalls::CheckCall(lines 46-62). Both methods check for the same two CUDA PDL function names.🔎 Suggested refactor to reduce duplication
Extract the function name detection into a shared helper:
namespace { bool IsCudaPdlFunction(const std::string& func_name) { return func_name == "cudaTriggerProgrammaticLaunchCompletion" || func_name == "cudaGridDependencySynchronize"; } } // namespaceThen simplify both
CheckCallmethods to use the helper. This centralizes the PDL function names to a single location.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/transform/common/attr.hsrc/transform/lower_pdl.cc
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-15T08:56:23.827Z
Learnt from: cherichy
Repo: tile-ai/tilelang PR: 1421
File: tilelang/contrib/cutedsl/reduce.py:161-184
Timestamp: 2025-12-15T08:56:23.827Z
Learning: In Tilelang's CUDA backend and CuTeDSL backend, barrier IDs 1 and 2 are reserved for internal use (such as in AllReduce operations). User-defined barriers should use IDs starting from 3 to avoid synchronization conflicts.
Applied to files:
src/transform/common/attr.h
📚 Learning: 2025-09-12T09:47:46.474Z
Learnt from: kurisu6912
Repo: tile-ai/tilelang PR: 794
File: tilelang/transform/add_bufstore_wrapper.py:30-33
Timestamp: 2025-09-12T09:47:46.474Z
Learning: In TVM's PyStmtExprMutator, visit_block_ methods typically call super().visit_block_(op) to process child nodes and update internal state, but return the original op when the block itself doesn't need transformation. The pattern `return op` is correct for blocks that serve as containers where mutations happen at deeper levels.
Applied to files:
src/transform/lower_pdl.cc
🧬 Code graph analysis (1)
src/transform/lower_pdl.cc (2)
tilelang/language/tir/op.py (1)
call_extern(173-195)tilelang/contrib/nvcc.py (1)
have_pdl(590-595)
🪛 Cppcheck (2.19.0)
src/transform/lower_pdl.cc
[error] 126-126: syntax error
(syntaxError)
🔇 Additional comments (5)
src/transform/common/attr.h (1)
14-18: LGTM! Clean attribute namespace organization.The new
attrnamespace properly scopes the PDL-related attribute constants with clear naming and appropriate visibility for use across the codebase.src/transform/lower_pdl.cc (4)
1-23: LGTM! Proper file structure and includes.The file header, includes, and namespace setup follow TVM conventions correctly.
24-69: LGTM! Correct PDL call detection and annotation.The
MarkCudaSyncCallsclass properly scans for CUDA PDL extern calls and annotates the PrimFunc with the appropriate attributes. The logic correctly identifiescudaTriggerProgrammaticLaunchCompletionandcudaGridDependencySynchronizecalls.
117-124: LGTM! Conditional pass logic is correct.The pass factory correctly dispatches to either mark or eliminate PDL calls based on the
have_pdlflag, which depends on CUDA compute capability.
126-130: LGTM! FFI registration follows TVM conventions.The pass registration is correct. The static analysis syntax error is a false positive—Cppcheck doesn't recognize the
TVM_FFI_STATIC_INIT_BLOCKmacro.
silentCoder-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When have_pdl is not supported, we should skip the lower_pdl pass. Running it to try and eliminate PDL calls would be incorrect or unnecessary.


Add PDL support as per Issue #1463
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.