Skip to content

Conversation

@LeiWang1999
Copy link
Member

This pull request includes several updates to the README.md, docs/Installation.md, and various other files to improve documentation, update repository URLs, and introduce new functionality. The most important changes include adding a new section to the README, updating repository URLs, and adding new JIT compilation functionality.

Documentation updates:

  • README.md: Added a "Latest News" section announcing the open-source release of tile-lang.
  • README.md: Added a new benchmark image for Dequantize Matmul Performance on A100.

Repository URL updates:

  • README.md, docker/Dockerfile.cu120, docker/README.md, docs/Installation.md, setup.py: Updated repository URLs from https://github.com/microsoft/TileLang to https://github.com/TileLang/tile-lang. [1] [2] [3] [4] [5]

Code enhancements:

@LeiWang1999
Copy link
Member Author

1. Overall Flow

def matmul(...):
    # 1) Derive A, B, C shapes from user parameters
    # 2) Use tilelang.language as T to write the kernel function "main"
    
    @tilelang.jit(
        out_idx=-1,  # create the output tensor at runtime
    )
    @T.prim_func
    def main(
        A: T.Buffer(A_shape, in_dtype),
        B: T.Buffer(B_shape, in_dtype),
        C: T.Buffer((M, N), out_dtype),
    ):
        # TileLang DSL: T.Kernel, T.alloc_shared, T.gemm, T.copy, etc.
        ...
    
    return main  # returns the compiled kernel
  • @T.prim_func
    Converts the Python function body containing TileLang DSL (like T.Kernel, T.copy, T.gemm, etc.) into low-level TVM TIR (PrimFunc).

  • @tilelang.jit
    After parsing into TIR, it handles further steps such as lower and build, and also manages how output tensors are created at runtime based on the out_idx setting.

    • In this example, out_idx=-1 means the output buffer is automatically allocated during runtime and returned to the user.

When matmul(...) is called, it produces a compiled kernel (the function main), which can then be called to run on the GPU/CPU.


2. Running the GEMM: run_gemm(...)

def run_gemm(...):
    # 1) Obtain the compiled kernel
    matmul_kernel = matmul(...)

    # 2) Prepare PyTorch tensors (A, B) with optional transpose
    A = torch.randn(M, K, dtype=...).cuda()
    B = torch.randn(K, N, dtype=...).cuda()
    if trans_A: A = A.T
    if trans_B: B = B.T

    # 3) Execute the compiled kernel and get the output C
    C = matmul_kernel(A, B)

    # 4) Compare with reference result (torch.matmul)
    ref_C = torch.matmul(A.float(), B.float()).to(out_dtype)
    tilelang.testing.torch_assert_close(C, ref_C, atol=1e-2, rtol=1e-2)
  • matmul_kernel(A, B) actually invokes the compiled function main(A, B), which runs on the GPU/CPU.
  • Because out_idx=-1, we only supply the input tensors (A, B), and the output is created internally by the TileLang runtime and returned as C.

3. The Role of @tilelang.jit Parameters

Within the @tilelang.jit decorator, the key parameter shown is out_idx=-1. Some notes:

  • out_idx
    • Identifies which function argument is the output tensor.
    • If out_idx=-1, you do not need to pass an output buffer; the TileLang runtime will allocate one on your behalf and return it.
    • Alternatively, if out_idx=2, for instance, you would be required to provide an already-allocated output buffer as the second argument to the kernel.

You can also configure other arguments in @tilelang.jit(...) (e.g. target='cuda', block_size=128, etc.) to control device selection, scheduling strategies, or other runtime details.

@LeiWang1999 LeiWang1999 merged commit 444892f into main Jan 20, 2025
3 of 4 checks passed
@LeiWang1999 LeiWang1999 deleted the jit branch January 20, 2025 16:17
vincentccc pushed a commit to vincentccc/tilelang that referenced this pull request Jul 21, 2025
* README.md fixed

* update test ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants