Skip to content

TEI stops in "Warming up model" phase #758

@gururaser

Description

@gururaser

System Info

Hello,

Hugginface Text Embedding Inference version: 1.8.3
Which image do I use?: ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.3
Model: janni-t/qwen3-embedding-0.6b-tei-onnx

Container stops in "Warming up model" phase.

My docker-compose code:

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.3
    ports:
      - "8080:80"
    volumes:
      - embedding_cache:/data
    command: ["--model-id", "janni-t/qwen3-embedding-0.6b-tei-onnx","--pooling", "mean", "--auto-truncate"] 

Logs:

2025-11-16 14:48:19.055 | 2025-11-16T11:48:19.053494Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "jan**-*/*****-*********-*.**-***-*nnx", revision: None, tokenization_workers: None, dtype: None, pooling: Some(Mean), max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "af8152803796", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-11-16 14:48:19.135 | 2025-11-16T11:48:19.135066Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-11-16 14:48:19.135 | 2025-11-16T11:48:19.135097Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-11-16 14:48:19.588 | 2025-11-16T11:48:19.588526Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
2025-11-16 14:48:19.747 | 2025-11-16T11:48:19.747345Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
2025-11-16 14:48:19.904 | 2025-11-16T11:48:19.903960Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
2025-11-16 14:48:20.057 | 2025-11-16T11:48:20.056866Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
2025-11-16 14:48:20.218 | 2025-11-16T11:48:20.218312Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
2025-11-16 14:48:20.372 | 2025-11-16T11:48:20.372777Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
2025-11-16 14:48:20.525 | 2025-11-16T11:48:20.525037Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-11-16 14:48:20.525 | 2025-11-16T11:48:20.525299Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-11-16 14:48:20.525 | 2025-11-16T11:48:20.525332Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-11-16 14:48:20.525 | 2025-11-16T11:48:20.525505Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 1.390442692s
2025-11-16 14:48:20.802 | 2025-11-16T11:48:20.801853Z  WARN text_embeddings_router: router/src/lib.rs:191: Could not find a Sentence Transformers config
2025-11-16 14:48:20.802 | 2025-11-16T11:48:20.802231Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16384 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (32768 > 16384), as `--auto-truncate` is enabled.
2025-11-16 14:48:20.802 | 2025-11-16T11:48:20.802420Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16384
2025-11-16 14:48:20.802 | 2025-11-16T11:48:20.802747Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 20 tokenization workers
2025-11-16 14:48:21.436 | 2025-11-16T11:48:21.436206Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-11-16 14:48:21.438 | 2025-11-16T11:48:21.437916Z  INFO text_embeddings_backend: backends/src/lib.rs:627: Downloading `model.onnx`
2025-11-16 14:48:21.438 | 2025-11-16T11:48:21.438898Z  INFO text_embeddings_backend: backends/src/lib.rs:641: Downloading `model.onnx_data`
2025-11-16 14:48:21.439 | 2025-11-16T11:48:21.439075Z  INFO text_embeddings_backend: backends/src/lib.rs:381: Model ONNX weights downloaded in 1.188241ms
2025-11-16 14:48:21.439 | 2025-11-16T11:48:21.439090Z  INFO text_embeddings_backend: backends/src/lib.rs:389: Downloading `tokenizer_config.json`
2025-11-16 14:48:27.205 | 2025-11-16T11:48:27.205815Z  INFO text_embeddings_router: router/src/lib.rs:282: Warming up model

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Create a docker-compose.yml like this:

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.3
    ports:
      - "8080:80"
    volumes:
      - embedding_cache:/data
    command: ["--model-id", "janni-t/qwen3-embedding-0.6b-tei-onnx","--pooling", "mean", "--auto-truncate"] 

volumes:
  embedding_cache:
    driver: local

Expected behavior

This happens when I use normal qwen3-embedding-0.6b model with cuda image:

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3
    ports:
      - "8080:80"
    volumes:
      - embedding_cache:/data
    command: ["--model-id", "Qwen/Qwen3-Embedding-0.6B", "--dtype", "float16", "--max-client-batch-size", "512", "--max-batch-tokens", "32768"]
    environment:
      - USE_FLASH_ATTENTION=True
    gpus: all
2025-11-13 15:06:25


2025-11-13T12:06:25.650647Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "Qwe*/*****-*********-0.6B", revision: None, tokenization_workers: None, dtype: Some(Float16), pooling: None, max_concurrent_requests: 512, max_batch_tokens: 32768, max_batch_requests: None, max_client_batch_size: 512, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "76f54a599a7b", port: 80, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-11-13 15:06:25


2025-11-13T12:06:25.730681Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-11-13 15:06:25


2025-11-13T12:06:25.730710Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-11-13 15:06:25


2025-11-13T12:06:25.731065Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-11-13 15:06:33


2025-11-13T12:06:33.737414Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_roberta_config.json`
2025-11-13 15:06:39


2025-11-13T12:06:39.970555Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_distilbert_config.json`
2025-11-13 15:06:47


2025-11-13T12:06:47.979089Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_camembert_config.json`
2025-11-13 15:06:55


2025-11-13T12:06:55.981490Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_albert_config.json`
2025-11-13 15:07:03


2025-11-13T12:07:03.991526Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlm-roberta_config.json`
2025-11-13 15:07:10


2025-11-13T12:07:10.376962Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_xlnet_config.json`
2025-11-13 15:07:18


2025-11-13T12:07:18.379599Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-11-13 15:07:18


2025-11-13T12:07:18.379646Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-11-13 15:07:18


2025-11-13T12:07:18.379654Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-11-13 15:07:18


2025-11-13T12:07:18.380190Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 56.042275076s
2025-11-13 15:07:18


2025-11-13T12:07:18.661817Z  WARN text_embeddings_router: router/src/lib.rs:191: Could not find a Sentence Transformers config
2025-11-13 15:07:18


2025-11-13T12:07:18.661844Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 32768
2025-11-13 15:07:18


2025-11-13T12:07:18.662172Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 20 tokenization workers
2025-11-13 15:07:19


2025-11-13T12:07:19.258831Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-11-13 15:07:19


2025-11-13T12:07:19.260767Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-11-13 15:07:19


2025-11-13T12:07:19.261868Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 1.104616ms
2025-11-13 15:07:19


2025-11-13T12:07:19.261889Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-11-13 15:07:19


2025-11-13T12:07:19.262114Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 229.358µs
2025-11-13 15:07:19


2025-11-13T12:07:19.904902Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:506: Starting FlashQwen3 model on Cuda(CudaDevice(DeviceId(1)))
2025-11-13 15:10:10


2025-11-13T12:10:10.676754Z  INFO text_embeddings_router: router/src/lib.rs:282: Warming up model
2025-11-13 15:10:13


2025-11-13T12:10:13.483120Z  WARN text_embeddings_router: router/src/lib.rs:341: Invalid hostname, defaulting to 0.0.0.0
2025-11-13 15:10:13


2025-11-13T12:10:13.485628Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:80
2025-11-13 15:10:13


2025-11-13T12:10:13.485647Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions