Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions _posts/2025-08-18-diff-distill.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Diffusion and flow-based models<d-cite key="ho2020denoising, lipman_flow_2023, a

At its core, diffusion models (equivalently, flow matching models) operate by iteratively refining noisy data into high-quality outputs through a series of denoising steps. Similar to divide-and-conquer algorithms <d-footnote>Common ones like Mergesort, locating the median and Fast Fourier Transform.</d-footnote>, diffusion models first *divide* the difficult denoising task into subtasks and *conquer* one of these at a time during training. To obtain a sample, we make a sequence of recursive predictions which means we need to *conquer* the entire task end-to-end.

This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-cite key="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-cite key="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-cite key="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), parameter-efficient fine-tuning<d-cite key="hu2021lora"></d-cite>, and advanced solver<d-cite key="lu2025dpm"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.

Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments.

Expand Down Expand Up @@ -252,6 +252,8 @@ $$
\dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0.
$$

This is intuitive since every point on the same probability flow ODE (\ref{eq:1}) trajectory should be mapped to the same clean data point $$\mathbf{x}_0$$.

By substituting the parameterization of FACM, we have

$$\require{physics}
Expand All @@ -262,9 +264,13 @@ Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates


<span style="color: blue; font-weight: bold;">Training</span>: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss<d-footnote>$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \, \|\mathbf{y}\|_{2}}$</d-footnote> and norm $\ell_2$ loss<d-footnote>$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow<d-cite key="geng2025mean"></d-cite>.</d-footnote> respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.

<div class="row mt-3">
<div class="col-sm mt-3 mt-md-0">
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/FACM_training.png" class="img-fluid rounded z-depth-1" %}
<div class="caption">
The modified training algorithm of FACM<d-cite key="peng2025flow"></d-cite>. All the notations are adapted to our flow map.
</div>
</div>
</div>

Expand Down
9 changes: 9 additions & 0 deletions assets/bibliography/2025-08-18-diff-distill.bib
Original file line number Diff line number Diff line change
Expand Up @@ -180,4 +180,13 @@ @article{xu2025one
author={Xu, Yilun and Nie, Weili and Vahdat, Arash},
journal={arXiv preprint arXiv:2502.15681},
year={2025}
}

@article{lu2025dpm,
title={Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models},
author={Lu, Cheng and Zhou, Yuhao and Bao, Fan and Chen, Jianfei and Li, Chongxuan and Zhu, Jun},
journal={Machine Intelligence Research},
pages={1--22},
year={2025},
publisher={Springer}
}