huggingface
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 20 additions & 23 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 20 additions & 23 deletions
diff --git a/‎docs/source/en/attention_interface.md‎
Lines changed: 105 additions & 90 deletions b/‎docs/source/en/attention_interface.md‎
Lines changed: 105 additions & 90 deletions
@@ -23,8 +23,6 @@
       title: Legacy model contribution
     - local: auto_docstring
       title: Documenting a model
-    - local: attention_interface
-      title: Customizing attention function
     title: Models
   - sections:
     - local: fast_tokenizers
@@ -61,11 +59,29 @@
     - local: llm_tutorial
       title: Text generation
     - local: generation_strategies
-      title: Generation strategies
+      title: Decoding methods
     - local: generation_features
       title: Generation features
     - local: tasks/prompting
       title: Prompt engineering
+    - local: perplexity
+      title: Perplexity of fixed-length models
+    title: Generate API
+  - sections:
+    - local: attention_interface
+      title: Attention backends
+    - local: continuous_batching
+      title: Continuous batching
+    - local: kernel_doc/overview
+      title: Kernels in transformers
+    - local: perf_torch_compile
+      title: torch.compile
+    - local: perf_infer_gpu_one
+      title: GPU
+    - local: perf_infer_gpu_multi
+      title: Distributed inference
+    - local: perf_infer_cpu
+      title: CPU
     - local: llm_optims
       title: Optimizing inference
     - local: cache_explanation
@@ -74,9 +90,7 @@
       title: KV cache strategies
     - local: llm_tutorial_optimization
       title: Getting the most out of LLMs
-    - local: perplexity
-      title: Perplexity of fixed-length models
-    title: LLMs
+    title: Optimization
   - sections:
     - local: conversations
       title: Chat basics
@@ -101,24 +115,12 @@
     - local: open_webui
       title: Open WebUI
     title: Serving
-  - sections:
-    - local: perf_torch_compile
-      title: torch.compile
-    - local: perf_infer_gpu_one
-      title: GPU
-    - local: perf_infer_gpu_multi
-      title: Distributed inference
-    - local: perf_infer_cpu
-      title: CPU
-    title: Optimization
   - local: agents
     title: Agents
   - local: tools
     title: Tools
   - local: transformers_as_backend
     title: Transformers as modeling backend
-  - local: continuous_batching
-    title: Continuous Batching
   title: Inference
 - isExpanded: false
   sections:
@@ -218,11 +220,6 @@
   - local: quantization/contribute
     title: Contribute
   title: Quantization
-- isExpanded: false
-  sections:
-  - local: kernel_doc/overview
-    title: Kernels in transformers
-  title: Kernels
 - isExpanded: false
   sections:
   - local: serialization
 
@@ -13,103 +13,145 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# Attention Interface
+# Attention backends
 
-This page describes how to use the `AttentionInterface` in order to register custom attention functions to use with
-supported models.
+All attention implementations perform the same computation. Every token is compared to every other token. The difference is *how* the computation is performed. Basic attention scales poorly because it materializes the full attention matrix in memory, creating bottlenecks that slow down inference. Optimized implementations rearrange the math to reduce memory traffic for faster, more affordable inference.
 
-## Customizing attention function
+The [`AttentionInterface`] provides optimized attention implementations. It decouples the attention implementation from the model implementation to simplify experimentation with different functions. Add new backends easily with this consistent interface.
 
-Most recent models can now switch from one attention function used in the Attention layer to the other, thanks to a simple mapping.
-By default, we provide the implementation for [`sdpa`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html),
-[`flash_attention_2`](https://github.com/Dao-AILab/flash-attention) and [`flex_attention`](https://pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention)
-as well as `eager`, which is a simple matrix multiplication without any optimization on top.  
-This is the setting you can usually choose when instantiating a model:
+| attention backend | description |
+|---|---|
+| `"flash_attention_3"` | improves FlashAttention-2 by also overlapping operations and fusing forward and backward passes more tightly |
+| `"flash_attention_2"` | tiles computations into smaller blocks and uses fast on-chip memory |
+| `"flex_attention"` | framework for specifying custom attention patterns (sparse, block-local, sliding window) without writing low-level kernels by hand |
+| `"sdpa"` | built-in PyTorch implementation of [scaled dot product attention](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) |
+| <code>"paged&#124;flash_attention_2"</code> | Paged version of FlashAttention-2 |
+| <code>"paged&#124;sdpa"</code> | Paged version of SDPA |
+| <code>"paged&#124;eager"</code> | Paged version of eager |
 
-```python
-from transformers import AutoModelForCausalLM
+## Set an attention backend
 
-model_id = "meta-llama/Llama-3.2-1B"
+Use the `attn_implementation` argument in [`~PreTrainedModel.from_pretrained`] to instantiate a model with a specific attention function.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
 
-# Here, using flash attention as an example
-model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B", attn_implementation="flash_attention_2"
+)
 ```
 
-But what if you wanted to create your own attention function? Or simply play around with existing ones, adding
-a few statements here and there? You can now do so with the `AttentionInterface`! Here is an example:
+Switch between attention backends at runtime without reloading the model using [`~PreTrainedModel.set_attn_implementation`].
 
-```python
-from transformers import AutoModelForCausalLM, AttentionInterface
-from transformers.integrations.sdpa_attention import sdpa_attention_forward
-import torch
+```py
+model.set_attn_implementation("sdpa")
+```
 
-model_id = "meta-llama/Llama-3.2-1B"
+### Kernels
 
-def my_new_sdpa(*args, **kwargs):
-    print("I just entered the attention computation")
-    return sdpa_attention_forward(*args, **kwargs)
+Download and load compiled compute kernels directly from the [Hub](https://huggingface.co/models?other=kernels) at runtime with the [Kernels](https://huggingface.co/docs/kernels/index) library. This avoids packaging issues from mismatched PyTorch or CUDA versions.
 
-AttentionInterface.register("my_new_sdpa", my_new_sdpa)
+Kernels automatically register to [`AttentionInterface`] upon detection. You don't need to install the FlashAttention package explicitly.
 
-model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="my_new_sdpa")
-# Try running the forward with the new attention function
-model(torch.ones(1, 5, dtype=int))
+```py
+import torch
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B", attn_implementation="kernels-community/flash-attn2"
+)
 ```
 
-You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times).
+### SDPA context manager
 
-## Dynamically switching attention function
+PyTorch's scaled dot product attention (SDPA) selects the fastest attention function for CUDA backends automatically. It defaults to the PyTorch C++ implementation for other backends.
 
-You could dynamically change the model's attention function as well:
+Force SDPA to use a specific implementation with the [torch.nn.attention.sdpa_kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html) context manager.
 
-```python
-# Back to use original sdpa implementation
-model.set_attn_implementation("sdpa")
+```py
+import torch
+from torch.nn.attention import SDPBackend, sdpa_kernel
+from transformers import AutoModelForCausalLM
 
-model(torch.ones(1, 5, dtype=int))
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B", attn_implementation="sdpa"
+)
+
+with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
+    outputs = model.generate(**inputs)
 ```
 
-and it will stop printing the statements, as it now uses the `sdpa` attention.  
-This allows to quickly change an attention function, without needing to reload the model!
+## Backbone-specific attention
 
-## Different attention per backbone in multimodal models
+Multimodal models use different backbones for each modality. Optimize performance by assigning specific attention functions to each backbone. Some vision backbones perform better in fp32, for example, which FlashAttention does not support.
 
-For multimodal models different attention functions may work better for each backbone module. For example, some vision backbones perform better in fp32, but are incompatible with FlashAttention. To continue using FlashAttention while keeping the vision encoder in fp32, create a dict and map each config to an attention implementation as shown below.
+Map vision backbones to different attention functions with a dict while the text backbone continues to use FlashAttention. Keys in the attention implementation must match sub-config names.
 
-```python
+```py
 from transformers import AutoModelForImageTextToText
 
-model_id = "facebook/chameleon-7b"
-
 attention_implementation_per_backbone = {"vision_config": "sdpa", "text_config": "flash_attention_2"}
-model = AutoModelForImageTextToText.from_pretrained(model_id, attn_implementation=attention_implementation_per_backbone)
 
-# NOTE: keys in the attention implementation have to be the same as the sub-config names
 for key in attention_implementation_per_backbone:
     assert key in model.config.sub_configs, f"Invalid key in `attention_implementation`"
 
-# You can omit certain backbones - the default attention function (SDPA) will be used
-# This is equivalent to the previous example
-model = AutoModelForImageTextToText.from_pretrained(model_id, attn_implementation={"text_config": "flash_attention_2"})
+model = AutoModelForImageTextToText.from_pretrained(
+    "facebook/chameleon-7b", attn_implementation=attention_implementation_per_backbone
+)
+```
+
+Omit certain backbones from the dict to use the default attention function (SDPA).
+
+```py
+model = AutoModelForImageTextToText.from_pretrained(
+    "facebook/chameleon-7b", attn_implementation={"text_config": "flash_attention_2"}
+)
+```
+
+Set the same attention function for all backbones with a single string.
 
+```py
+model = AutoModelForImageTextToText.from_pretrained(
+    "facebook/chameleon-7b", attn_implementation="eager"
+)
+```
 
-# Set the same attention implementation for all backbones with single string, same as in non-multimodal models
-model = AutoModelForImageTextToText.from_pretrained(model_id, attn_implementation="eager")
+Set the attention function globally with an empty key.
 
-# Alternatively use a dict with an empty key for global configuration
-model = AutoModelForImageTextToText.from_pretrained(model_id, attn_implementation={"": "eager"})
+```py
+model = AutoModelForImageTextToText.from_pretrained(
+    "facebook/chameleon-7b", attn_implementation={"": "eager"}
+)
 ```
 
-## What about new args needed in my custom attention function?
+## Create a new attention function
+
+Customize or create new attention functions by adding them to the attention registry with [`AttentionInterface.register`]. Models use these functions through the `attn_implementation` argument.
 
-But indeed, what if the new function requires a new arg to be properly used? It's no issue! Models supporting the
-`AttentionInterface` propagate kwargs all the way to the Attention layers, and to the used attention function. That way,
-you can simply pass the arg (as a kwargs, i.e. you need to qualify the name of the arg) in the model's forward, and it will be correctly used in the attention. However, custom attention functions have some limitations. In particular, it must follow the signature and return format of other attention functions, i.e.
+This example customizes the attention function to print a statement for each layer.
 
 ```python
+import torch
 from transformers import AutoModelForCausalLM, AttentionInterface
 from transformers.integrations.sdpa_attention import sdpa_attention_forward
+
+def my_new_sdpa(*args, **kwargs):
+    print("I just entered the attention computation")
+    return sdpa_attention_forward(*args, **kwargs)
+
+AttentionInterface.register("my_new_sdpa", my_new_sdpa)
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", attn_implementation="my_new_sdpa")
+model(torch.ones(1, 5, dtype=int))
+```
+
+You can also add new arguments to the attention function. Models supporting [`AttentionInterface`] propagate kwargs to attention layers and the attention function. Pass arguments as kwargs in the model's forward function. Custom attention functions must follow this signature and return format.
+
+```python
 import torch
+from transformers import AutoModelForCausalLM, AttentionInterface
+from transformers.integrations.sdpa_attention import sdpa_attention_forward
 
 def custom_attention(
     module: torch.nn.Module,  # required arg
@@ -127,44 +169,19 @@ def custom_attention(
 AttentionInterface.register("custom", custom_attention)
 
 model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="custom")
-# Forward pass with the new kwargs
 model(torch.ones(1, 5, dtype=int), a_new_kwargs=..., another_new_kwargs=...)
 ```
 
-If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
+Check a model's [modeling code](https://github.com/huggingface/transformers/tree/main/src/transformers/models) to confirm what arguments and kwargs it sends to the attention function.
 
-## Accessing current available implementations
+### AttentionMaskInterface
 
-Most of the time, you will simply need to `register` a new function. If, however, you need to access an existing one,
-and/or perform a few checks, the preferred way is to use the global `ALL_ATTENTION_FUNCTIONS`. It behaves the same way you
-would expect from a usual Python dictionary:
-
-```python
->>> from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
-
->>> list(ALL_ATTENTION_FUNCTIONS.keys())
->>> ['flash_attention_2', 'flex_attention', 'sdpa']
-
->>> ALL_ATTENTION_FUNCTIONS["sdpa"]
->>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
-
->>> ALL_ATTENTION_FUNCTIONS.get("sdpa", None)
->>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
-
-# You can also globally `register` a new function directly on it
->>> ALL_ATTENTION_FUNCTIONS.register("new_func", new_func)
-```
-
-## Attention Mask Interface
-
-Having a new attention function may mean that you need a new format of attention mask to decide what key and value tokens
-the query tokens should attend to. This is now possible with the `AttentionMaskInterface`! It works in the same way as
-the `AttentionInterface`:
+Configure which key and value tokens queries attend to with [`AttentionMaskInterface`]. Some attention functions require this configuration. Customize the attention mask function and add it to the registry with [`AttentionMaskInterface.register`].
 
 ```python
+import torch
 from transformers import AttentionMaskInterface
 from transformers.masking_utils import sdpa_mask
-import torch
 
 def my_new_sdpa_mask(*args, **kwargs):
     print("I just entered the attention mask computation")
@@ -173,11 +190,9 @@ def my_new_sdpa_mask(*args, **kwargs):
 AttentionMaskInterface.register("my_new_sdpa_mask", my_new_sdpa_mask)
 ```
 
-The reason you have to register it is because we need to automatically correct your mask format based on the attention implementation (for example, flex attention uses a BlockMask format, while sdpa uses a 4D tensor).
-By default, if you do not register an attention mask function along with your attention function, mask creation will be skipped
-and `attention_mask=None` will be passed along to the Attention layers.
+Registered attention masks automatically correct the mask format for the attention implementation. For example, FlexAttention uses a [BlockMask](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html?utm_source=chatgpt.com#torch.nn.attention.flex_attention.BlockMask) format, while SDPA uses a 4D tensor. Without a registered attention mask function, mask creation is skipped and `attention_mask=None` passes to the model's attention layers.
 
-The default signature of the attention mask functions is the following:
+This is the default signature for an attention mask function.
 
 ```python
 def custom_attention_mask(
@@ -191,6 +206,6 @@ def custom_attention_mask(
 ) -> Optional[torch.Tensor]:
 ```
 
-It mostly works thanks to the `mask_function`, which is a `Callable` in the form of [torch's mask_mod functions](https://pytorch.org/blog/flexattention/), taking 4 indices as input and returning a boolean to indicate if this position should take part in the attention computation.
+The `mask_function` argument is a `Callable` that mimics PyTorch's [mask_mod](https://pytorch.org/blog/flexattention/) functions. It takes 4 indices as input and returns a boolean. This boolean indicates if the position contributes to the attention computation.
 
-If you cannot use the `mask_function` to create your mask for some reason, you can try to work around it by doing something similar to our [torch export workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py).
+Use this [workaround](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py) for torch export if `mask_function` fails to create a mask.