Skip to content

Conversation

@shuyingl
Copy link

@shuyingl shuyingl commented Dec 6, 2025

Summary

This PR adds support for DeepSeek-V3.2-Exp, which extends DeepSeek V3 with DeepSeek Sparse Attention (DSA) powered by a Lightning Indexer.

Key Features

  • Lightning Indexer: Enables O(L*k) sparse attention complexity (vs O(L²) dense), where k=2048 tokens are selected per query
  • Hadamard Transform: Applied to Q/K in the indexer for activation rotation, with pure PyTorch fallback when fast-hadamard-transform is not installed
  • Non-interleaved RoPE in Indexer: Critical difference from MLA which uses interleaved RoPE (as noted in the official repo's bug fix)
  • Training Support: Full support for training with configurable sparse attention toggle and detachable indexer input for the two-stage training approach described in the technical report

New Configuration Parameters

Parameter Default Description
index_n_heads 64 Number of indexer heads
index_head_dim 128 Indexer head dimension
index_topk 2048 Tokens selected for sparse attention
use_sparse_attention True Toggle sparse vs dense attention
detach_indexer_input False For Stage 2 training optimization

Training Strategy (from Technical Report)

The implementation supports the two-stage training approach:

  1. Stage 1 (Dense Warm-up): Train only the indexer with dense attention to align with main attention distribution
  2. Stage 2 (Sparse Training): Train full model with sparse attention, indexer input detached for separate optimization

Files Changed

  • New model directory: src/transformers/models/deepseek_v32/
  • Updated auto mappings for config, model, and task-specific classes

References

Test plan

  • Config import test
  • Model import test
  • Forward pass test with small config
  • Numerical comparison with reference implementation

🤖 Generated with Claude Code

This PR adds support for DeepSeek-V3.2-Exp, which extends DeepSeek V3 with
DeepSeek Sparse Attention (DSA) powered by a Lightning Indexer.

Key features:
- Lightning Indexer for O(L*k) sparse attention (vs O(L^2) dense)
- Hadamard transform for activation rotation with PyTorch fallback
- Non-interleaved RoPE in indexer (vs interleaved in MLA)
- Configurable sparse attention toggle for training stages
- Full training support with detachable indexer input

New config parameters:
- index_n_heads: Number of indexer heads (default: 64)
- index_head_dim: Indexer head dimension (default: 128)
- index_topk: Tokens selected for sparse attention (default: 2048)
- use_sparse_attention: Toggle sparse vs dense attention
- detach_indexer_input: For Stage 2 training optimization

Reference: https://github.com/deepseek-ai/DeepSeek-V3.2-Exp

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant