Add DeepSeek V3.2 model support with Lightning Indexer sparse attention #42672

shuyingl · 2025-12-06T02:50:49Z

Summary

This PR adds support for DeepSeek-V3.2-Exp, which extends DeepSeek V3 with DeepSeek Sparse Attention (DSA) powered by a Lightning Indexer.

Key Features

Lightning Indexer: Enables O(L*k) sparse attention complexity (vs O(L²) dense), where k=2048 tokens are selected per query
Hadamard Transform: Applied to Q/K in the indexer for activation rotation, with pure PyTorch fallback when fast-hadamard-transform is not installed
Non-interleaved RoPE in Indexer: Critical difference from MLA which uses interleaved RoPE (as noted in the official repo's bug fix)
Training Support: Full support for training with configurable sparse attention toggle and detachable indexer input for the two-stage training approach described in the technical report

New Configuration Parameters

Parameter	Default	Description
`index_n_heads`	64	Number of indexer heads
`index_head_dim`	128	Indexer head dimension
`index_topk`	2048	Tokens selected for sparse attention
`use_sparse_attention`	True	Toggle sparse vs dense attention
`detach_indexer_input`	False	For Stage 2 training optimization

Training Strategy (from Technical Report)

The implementation supports the two-stage training approach:

Stage 1 (Dense Warm-up): Train only the indexer with dense attention to align with main attention distribution
Stage 2 (Sparse Training): Train full model with sparse attention, indexer input detached for separate optimization

Files Changed

New model directory: src/transformers/models/deepseek_v32/
Updated auto mappings for config, model, and task-specific classes

References

Official Repository: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp
Technical Report: DeepSeek_V3_2.pdf in the repo
HuggingFace Model: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp

Test plan

Config import test
Model import test
Forward pass test with small config
Numerical comparison with reference implementation

🤖 Generated with Claude Code

This PR adds support for DeepSeek-V3.2-Exp, which extends DeepSeek V3 with DeepSeek Sparse Attention (DSA) powered by a Lightning Indexer. Key features: - Lightning Indexer for O(L*k) sparse attention (vs O(L^2) dense) - Hadamard transform for activation rotation with PyTorch fallback - Non-interleaved RoPE in indexer (vs interleaved in MLA) - Configurable sparse attention toggle for training stages - Full training support with detachable indexer input New config parameters: - index_n_heads: Number of indexer heads (default: 64) - index_head_dim: Indexer head dimension (default: 128) - index_topk: Tokens selected for sparse attention (default: 2048) - use_sparse_attention: Toggle sparse vs dense attention - detach_indexer_input: For Stage 2 training optimization Reference: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-12-06T02:51:47Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DeepSeek V3.2 model support with Lightning Indexer sparse attention #42672

Add DeepSeek V3.2 model support with Lightning Indexer sparse attention #42672

shuyingl commented Dec 6, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add DeepSeek V3.2 model support with Lightning Indexer sparse attention #42672

Are you sure you want to change the base?

Add DeepSeek V3.2 model support with Lightning Indexer sparse attention #42672

Conversation

shuyingl commented Dec 6, 2025

Summary

Key Features

New Configuration Parameters

Training Strategy (from Technical Report)

Files Changed

References

Test plan

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant