(CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) #17802

deepshnv · 2025-12-05T19:27:21Z

Efficient VLM inference using llama-mtmd-cli for high resolution images while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images

…es while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images

…t_llm_init_callback and mtmd_set_llm_context remain exported as MTMD_API as they are needed in mtmd-cli.cpp, rest other JIT functions are now internal

github-actions bot added the examples label Dec 5, 2025

deepshnv mentioned this pull request Dec 5, 2025

Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage #17801

Open

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17802: Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) auroralabs-loci/llama.cpp#458

Open

ngxson changed the title ~~Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801)~~ (CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) Dec 6, 2025

moved JIT llm init helpers from mtmd.h to mtmd-helper.h. Only mtmd_se…

42974cc

…t_llm_init_callback and mtmd_set_llm_context remain exported as MTMD_API as they are needed in mtmd-cli.cpp, rest other JIT functions are now internal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) #17802

(CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) #17802

deepshnv commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

(CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) #17802

Are you sure you want to change the base?

(CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) #17802

Conversation

deepshnv commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant