Skip to content

Conversation

@deepshnv
Copy link

@deepshnv deepshnv commented Dec 5, 2025

Efficient VLM inference using llama-mtmd-cli for high resolution images while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images

…es while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images
@ngxson ngxson changed the title Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) (CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) Dec 6, 2025
…t_llm_init_callback and mtmd_set_llm_context remain exported as MTMD_API as they are needed in mtmd-cli.cpp, rest other JIT functions are now internal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant