Performance optimization of SDXL inference pipeline #1104

bssrdf · 2025-12-16T17:47:47Z

bssrdf
Dec 16, 2025

Over the past a couple of months, I have been trying to speed up SDXL inference, motivated by #772. Here is a summary of what I have achieved so far.

First, this is what a recent master (commit 8f6c5c2) can do

My optimized pipeline

A new implicit conv2d for cuda backend is added. PR is cuda : Add conv2d Implicit GEMM ggml-org/llama.cpp#15805. Overall, it is much faster than IM2COL and on par with pytroch/cudnn aggregated for all input/filter shapes.
A prototype of full FP16 UNET denoising pipeline is built. Currently the outputs of intermediate layers (FA, MLP, CONV2D etc) are all FP32 tensors. Even though internally these ops are done in FP16 but at the end the results needed to be converted to feed next layer. The new pipeline totally eliminated this conversion.
Enabled CUDA Graph
Operator fusion: mul_mat+add, norm+add+mul
Added NHWC layout support

Combining roughly these 5 optimizations above, on 4090, I can do

	sd.cpp (--diffusion-fa)	sd.cpp (--diffusion-fa, implicit conv2d, fp16 pipeline, cuda graph, fused operators)	diffusion-fast
512x512	8.94it/s	~~12.5it/s~~ 19it/s	~16it/s
768x768	6.29it/s	~~8.5it/s~~ 11.59it/s	~13.5it/s
1024x1024	4.75it/s	~~6.35it/s 6.5it/s 7.17it/s 7.6it/s~~ 7.99it/s	~8it/s

As a side project, I also tried to improve FLUX inference pipeline.

A full BF16 flux denoising pipeline is built. The result is a 30% performance bump. Here shows pytorch/comfyui can already do 2.1it/s more than 1 year ago. Still have a gap but I think the copy op cost can be further reduced.

FLUX Dev Q8_0	sd.cpp (--diffusion-fa)	sd.cpp (--diffusion-fa, bf16 pipeline)	sd.cpp (--diffusion-fa, bf16 pipeline, transposed copy, cuda graph)
1024x1024	1.24it/s	1.62it/s	1.87it/s

@JustMaier, not sure if you and your team are still interested in pursuing using sd.cpp. If so, please let me know and you can give it a try in your production environment. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimization of SDXL inference pipeline #1104

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Performance optimization of SDXL inference pipeline #1104

Uh oh!

Uh oh!

bssrdf Dec 16, 2025

Replies: 0 comments

bssrdf
Dec 16, 2025