[Dev] Add vertical slash sparse attention and benchmark results #25

xwhzz · 2025-09-03T13:59:31Z

No description provided.

…ipts, and basic kernel implementations

…rmatting in test and kernel files

… include runtime requirements

…e requirements

…nel initialization

* [Draft]Add grouped query attention fwd and bwd kernels * remove profiling the performance from the check function

* [Draft]Add mamba chunk scan attention fwd kernel * [Draft]Add mamba chunk scan attention fwd kernel * [Draft]Add mamba chunk scan attention fwd kernel * [Draft] add args.tune in mamba_chunk_scan test

* [Dev] Add mamba chunk state attention fwd kernel * [Dev] Add mamba chunk state attention fwd kernel * add type ignore * add type ignore

* Add `type:ignore` to improve hint style * Add MHA fwd/bwd kernels&test for BSHD layout * Remove redundant code * Improve naming * Run yapf and ruff

* Add blocksparse_flash_attention fwd kernel * Add blocksparse_flash_attention bwd kernel * fix style and typo

- change all kernel class names from lowercase to camelCase (e.g. MLA_kernel to MLAKernel), and optimize code format for better readability. Update relevant documents to reflect these changes.

* [Draft]Add grouped query attention fwd and bwd kernels * remove profiling the performance from the check function * [Dev] Add bitnet prefill and decode kernels

* Add MHA decode kernel & update naming. * Correct spelling.

* Add linear attention recurrent kernel. * Run yapf and ruff * Remove unnecessary pkg

* [Dev] Add GQA decode kernel and update naming * [Style] Format code * [Fix] Correct loop range of gqa_decode_split_ref using ceiling division

* Add utils.py * Unify testing method. * Adding padding to support seq_len indivisible by chunk_size. * Fix typo in MHAKernel * Fix lint * Fix lint * [fix] generate `do` without the requiring gradient * [fix] generate `do` without the requiring gradient

…ai#21)

…ls, [Fix] Resolve bugs in autotune of GQA kernels (tile-ai#22) * [Dev] Add autotune support for GQA forward and backward passes * [Fix] Resolve bugs in autotune of GQA kernels * [Dev] Add autotune support for MHA decode, forward and backward kernels

gemini-code-assist

Summary of Changes

Hello @xwhzz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a highly optimized vertical slash sparse attention mechanism, leveraging custom CUDA kernels and TileLang for superior performance. It includes a complete implementation, a new kernel abstraction, and extensive benchmarking infrastructure to validate its efficiency and correctness.

Highlights

New Sparse Attention Mechanism: Introduced a novel 'Vertical Slash Sparse Attention' mechanism, designed for efficient computation in deep learning models.
CUDA Kernel for Index Conversion: Added a custom CUDA kernel (vertical_slash_index.cu) and its C++ bindings (kernels.cpp) to dynamically convert and manage sparse attention indexes, optimizing data access patterns.
TileLang and Triton Integration: The new attention kernel is implemented using TileLang for performance optimization, with a Triton-based reference program for parity checking and benchmarking.
Performance Benchmarking: Comprehensive benchmarks demonstrate significant speedups (up to 1.8x) over baseline Triton implementations on NVIDIA H100 GPUs, showcasing improved TFlops and IO bandwidth.
Kernel Abstraction and Utilities: A new KernelBase abstract class provides a standardized interface for kernel implementations, complemented by enhanced utility functions for robust parity checks and detailed performance analysis.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a vertical slash sparse attention mechanism, including CUDA kernels, a TileLang implementation, and benchmark results. The changes involve adding new files for the kernel implementation, modifying the __init__.py files to include the new kernel, and adding a test file to evaluate the performance of the new attention mechanism.

gemini-code-assist · 2025-09-03T14:01:05Z

top/kernel/vs_sparse_attention.py

+        o = mod(q, k, v, block_count, block_offset, column_count, column_index)
+        return o
+
+    @staticmethod


The backward function is currently a pass statement. Implement the backward pass for gradient calculation, or explicitly raise an error if gradients are not supported.

gemini-code-assist · 2025-09-03T14:01:05Z

top/kernel/vs_sparse_attention.py

+    return kernel_func(block_M, block_N, num_stages)
+
+
+@torch.compile


Consider adding a docstring to describe what this class does. This will improve readability and maintainability.

gemini-code-assist · 2025-09-03T14:01:05Z

top/kernel/vs_sparse_attention.py

+class VerticalSlashSparseAttentionKernel(KernelBase):
+    map_dtype = {torch.float16: "float16", torch.bfloat16: "bfloat16"}


Consider adding a docstring to describe what this class does. This will improve readability and maintainability.

gemini-code-assist · 2025-09-03T14:01:05Z

top/kernel/vs_sparse_attention.py

+    def pytorch_ref_program(self, *args, **kwargs):
+
+        def attention_func(queries, keys, values, attention_mask):
+            attention_weights = torch.matmul(queries, keys.transpose(2, 3)) / math.sqrt(queries.size(-1))
+            attention_weights += attention_mask.to(queries.dtype) * torch.finfo(queries.dtype).min
+            attention_weights = nn.functional.softmax(attention_weights, dim=-1, dtype=torch.float32).to(queries.dtype)
+            attention_output = torch.matmul(attention_weights, values)
+            return attention_output


This function pytorch_ref_program is defined but never called. Either remove it if it's not needed, or ensure it's integrated into the testing or usage of the class.

gemini-code-assist · 2025-09-03T14:01:05Z

tests/vertical_slash_sparse_attention/test_vs_sparse.py

+            seq_len=SEQ_LEN,
+            head_dim=D_HEAD,
+        )
+        partity(kernel, q, k, v, block_count=block_count, block_offset=block_offset, column_count=column_count, column_index=column_index)


Consider adding a try/except block around the partity call to catch potential assertion errors and provide more informative error messages, which is helpful for debugging.

gemini-code-assist · 2025-09-03T14:01:05Z

tests/vertical_slash_sparse_attention/test_vs_sparse.py

+        )
+        partity(kernel, q, k, v, block_count=block_count, block_offset=block_offset, column_count=column_count, column_index=column_index)
+
+        perf = performance(kernel, [kernel.ref_program], q, k, v, block_count=block_count, block_offset=block_offset, column_count=column_count, column_index=column_index)


Consider adding a try/except block around the performance call to catch potential errors and provide more informative error messages, which is helpful for debugging.

chengyupku and others added 30 commits June 3, 2025 10:32

[Setup] Add initial project structure with .gitignore, formatting scr…

a7a071e

…ipts, and basic kernel implementations

[Refactor] Update repository URL in format script and improve code fo…

b1b7ffd

…rmatting in test and kernel files

[Docs] Update README

9330c1a

[Setup] Add TileLang submodule, update installation instructions, and…

b13e46e

… include runtime requirements

[Setup] Add submodule for TileLang, update README, and include runtim…

c2c4b4f

…e requirements

[Docs] Update README to clarify tensor dimensions and enhance MLA ker…

72f38f1

…nel initialization

[Draft]Add grouped query attention fwd and bwd kernels (tile-ai#3)

46e6bdb

* [Draft]Add grouped query attention fwd and bwd kernels * remove profiling the performance from the check function

[Dev] Add mamba chunk scan attention fwd kernel (tile-ai#5)

ffe0134

* [Draft]Add mamba chunk scan attention fwd kernel * [Draft]Add mamba chunk scan attention fwd kernel * [Draft]Add mamba chunk scan attention fwd kernel * [Draft] add args.tune in mamba_chunk_scan test

[Dev] Add fused chunk linear attention kernels

516a1ab

[Setup] Configure code formatting and linting tools in pyproject.toml

4ce5733

[Dev] Add mamba chunk state attention fwd kernel (tile-ai#7)

5ea9ffd

* [Dev] Add mamba chunk state attention fwd kernel * [Dev] Add mamba chunk state attention fwd kernel * add type ignore * add type ignore

[Dev] Add MHA fwd/bwd kernels & refine format style (tile-ai#9)

64e5977

* Add `type:ignore` to improve hint style * Add MHA fwd/bwd kernels&test for BSHD layout * Remove redundant code * Improve naming * Run yapf and ruff

[Dev] Add block sparse attention prefill kernel (tile-ai#8)

f44d199

* Add blocksparse_flash_attention fwd kernel * Add blocksparse_flash_attention bwd kernel * fix style and typo

[Refactor] Refactor Kernel Naming

3f38664

- change all kernel class names from lowercase to camelCase (e.g. MLA_kernel to MLAKernel), and optimize code format for better readability. Update relevant documents to reflect these changes.

[LICENSE] Add copyright and license headers

4170367

[Dev] Add bitnet prefill and decode kernels (tile-ai#12)

d57048a

* [Draft]Add grouped query attention fwd and bwd kernels * remove profiling the performance from the check function * [Dev] Add bitnet prefill and decode kernels

[Dev] add check func for mla (tile-ai#11)

bde8952

Update README.md

0593c94

rename

21f66fe

Add MHA decode kernel & update naming. (tile-ai#16)

f23321d

* Add MHA decode kernel & update naming. * Correct spelling.

[Dev] Add linear attention recurrent mode kernels. (tile-ai#18)

66e3fc4

* Add linear attention recurrent kernel. * Run yapf and ruff * Remove unnecessary pkg

[Dev] Add GQA decode kernel and update naming (tile-ai#20)

626ef2b

* [Dev] Add GQA decode kernel and update naming * [Style] Format code * [Fix] Correct loop range of gqa_decode_split_ref using ceiling division

[Dev] Add autotune support for GQA forward and backward passes (tile-…

ccd80ae

…ai#21)

update vs sparse

94d4ee5

update

c3f96d4

upd

0395512

upd

ff73ad0

upd

1b5e0d1

update swizzle

6966e6e

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

xwhzz changed the title ~~[Example] Add vertical slash sparse attention and benchmark results~~ [Dev] Add vertical slash sparse attention and benchmark results Sep 3, 2025

upd

a04501c

xysmlx force-pushed the main branch from 16bf899 to 07fb1da Compare September 18, 2025 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev] Add vertical slash sparse attention and benchmark results #25

[Dev] Add vertical slash sparse attention and benchmark results #25

Uh oh!

xwhzz commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

		return kernel_func(block_M, block_N, num_stages)


		@torch.compile

		class VerticalSlashSparseAttentionKernel(KernelBase):
		map_dtype = {torch.float16: "float16", torch.bfloat16: "bfloat16"}

[Dev] Add vertical slash sparse attention and benchmark results #25

Are you sure you want to change the base?

[Dev] Add vertical slash sparse attention and benchmark results #25

Uh oh!

Conversation

xwhzz commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants