-
Notifications
You must be signed in to change notification settings - Fork 0
HTTP_API_SPECIFICATION
ThemisDB provides a comprehensive RESTful HTTP API for LLM operations, enabling inference, model management, LoRA operations, and statistics retrieval.
Base URL: /api/v1/llm
Authentication: Bearer token or API key (configured in llm_config.yaml)
Content-Type: application/json
POST /api/v1/llm/inference
Execute LLM inference with a prompt.
Request Body:
{
"prompt": "What is ThemisDB?",
"model": "mistral-7b",
"lora_adapter": "general-qa",
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"stop_sequences": ["\n\n", "END"],
"stream": false
}Response:
{
"text": "ThemisDB is a distributed graph database...",
"tokens_generated": 45,
"inference_time_ms": 150,
"model_used": "mistral-7b",
"lora_used": "general-qa",
"cache_hit": false,
"finish_reason": "stop"
}Status Codes:
-
200 OK: Successful inference -
400 Bad Request: Invalid parameters -
404 Not Found: Model or LoRA not found -
429 Too Many Requests: Queue full (backpressure) -
500 Internal Server Error: Inference failure
POST /api/v1/llm/rag
Execute RAG (Retrieval-Augmented Generation) inference with vector search.
Request Body:
{
"query": "What are the main provisions in contract clause 3.4?",
"collection": "legal_documents",
"top_k": 5,
"similarity_threshold": 0.8,
"model": "mistral-7b",
"lora_adapter": "legal-qa",
"max_tokens": 512,
"temperature": 0.7,
"context_assembly": "concat"
}Response:
{
"text": "Contract clause 3.4 contains the following provisions...",
"tokens_generated": 87,
"inference_time_ms": 210,
"documents_retrieved": 5,
"documents_used": 3,
"retrieval_time_ms": 45,
"model_used": "mistral-7b",
"lora_used": "legal-qa",
"cache_hit": false,
"finish_reason": "stop"
}POST /api/v1/llm/inference (with stream: true)
Stream tokens as they are generated.
Request Body:
{
"prompt": "Write a story about...",
"model": "mistral-7b",
"stream": true,
"max_tokens": 1024
}Response (Server-Sent Events):
data: {"token": "Once", "index": 0}
data: {"token": " upon", "index": 1}
data: {"token": " a", "index": 2}
data: {"token": " time", "index": 3}
...
data: {"done": true, "tokens_generated": 245, "inference_time_ms": 2150}
Headers:
Content-Type: text/event-streamCache-Control: no-cacheConnection: keep-alive
POST /api/v1/llm/embed
Generate embeddings for text.
Request Body:
{
"text": "Sample text for embedding generation",
"model": "mistral-7b",
"normalize": true
}Response:
{
"embedding": [0.123, -0.456, 0.789, ...],
"dimension": 4096,
"model_used": "mistral-7b",
"inference_time_ms": 25
}GET /api/v1/llm/models
List all available models.
Response:
{
"models": [
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"status": "loaded",
"size_bytes": 6400000000,
"format": "GGUF",
"n_layers": 32,
"loaded_timestamp": "2024-01-15T10:30:00Z",
"last_used": "2024-01-15T12:45:30Z",
"usage_count": 1247
},
{
"model_id": "llama-3-8b",
"status": "available",
"size_bytes": 8500000000,
"format": "GGUF"
}
]
}POST /api/v1/llm/models/load
Load a model into memory.
Request Body:
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"options": {
"n_gpu_layers": 32,
"n_ctx": 4096,
"n_batch": 512,
"n_threads": 8,
"use_mmap": true,
"use_mlock": false
},
"pin": false
}Response:
{
"model_id": "mistral-7b",
"status": "loaded",
"load_time_ms": 2850,
"memory_used_mb": 6200
}POST /api/v1/llm/models/unload
Unload a model from memory.
Request Body:
{
"model_id": "mistral-7b"
}Response:
{
"model_id": "mistral-7b",
"status": "unloaded",
"memory_freed_mb": 6200
}GET /api/v1/llm/models/{model_id}
Get detailed information about a specific model.
Response:
{
"model_id": "mistral-7b",
"path": "/models/mistral-7b.gguf",
"status": "loaded",
"size_bytes": 6400000000,
"format": "GGUF",
"version": "v0.3",
"architecture": "llama",
"n_layers": 32,
"n_heads": 32,
"n_embd": 4096,
"n_vocab": 32000,
"context_length": 8192,
"loaded_timestamp": "2024-01-15T10:30:00Z",
"last_used": "2024-01-15T12:45:30Z",
"usage_count": 1247,
"memory_usage_mb": 6200,
"gpu_layers": 32,
"pinned": false
}POST /api/v1/llm/models/ingest
Upload and ingest a model into ThemisDB blob storage.
Request (multipart/form-data):
POST /api/v1/llm/models/ingest
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="model_id"
llama-3-8b
--boundary
Content-Disposition: form-data; name="file"; filename="llama-3-8b.gguf"
Content-Type: application/octet-stream
[binary data]
--boundary
Content-Disposition: form-data; name="metadata"
Content-Type: application/json
{
"version": "v1.0",
"description": "Llama 3 8B quantized Q4",
"shard_affinity": "legal",
"replicate": true
}
--boundary--
Response:
{
"model_id": "llama-3-8b",
"version": "v1.0",
"urn": "urn:themis:model:llama-3-8b:v1",
"size_bytes": 8500000000,
"checksum": "sha256:abc123...",
"upload_time_ms": 45000,
"replication_status": "pending",
"shards_replicated": 0,
"total_shards": 4
}Note: For large models, use chunked upload with Content-Range headers.
GET /api/v1/llm/loras
List all available LoRA adapters.
Query Parameters:
-
model: Filter by base model -
status: Filter by status (loaded, available)
Response:
{
"loras": [
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"status": "loaded",
"size_bytes": 20971520,
"rank": 8,
"alpha": 16,
"loaded_timestamp": "2024-01-15T11:00:00Z",
"usage_count": 523
},
{
"lora_id": "medical-qa",
"base_model": "mistral-7b",
"status": "available",
"size_bytes": 20971520
}
]
}POST /api/v1/llm/loras/load
Load a LoRA adapter.
Request Body:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"scale": 1.0
}Response:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"status": "loaded",
"load_time_ms": 150,
"slot": 3
}POST /api/v1/llm/loras/unload
Unload a LoRA adapter.
Request Body:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b"
}Response:
{
"lora_id": "legal-qa",
"status": "unloaded",
"slot_freed": 3
}GET /api/v1/llm/loras/{lora_id}
Get detailed information about a specific LoRA.
Query Parameters:
-
base_model: Base model ID
Response:
{
"lora_id": "legal-qa",
"base_model": "mistral-7b",
"path": "/loras/legal-qa.bin",
"status": "loaded",
"size_bytes": 20971520,
"rank": 8,
"alpha": 16,
"target_modules": ["q_proj", "v_proj"],
"loaded_timestamp": "2024-01-15T11:00:00Z",
"last_used": "2024-01-15T12:40:15Z",
"usage_count": 523,
"slot": 3
}GET /api/v1/llm/stats
Get comprehensive LLM system statistics.
Response:
{
"uptime_seconds": 86400,
"total_requests": 15234,
"successful_requests": 15102,
"failed_requests": 132,
"active_requests": 8,
"queued_requests": 23,
"throughput": {
"requests_per_second": 128.5,
"tokens_per_second": 3456.2
},
"latency": {
"p50_ms": 24,
"p95_ms": 65,
"p99_ms": 180,
"avg_ms": 28
},
"models": {
"loaded": 2,
"total_available": 5,
"memory_used_mb": 12400
},
"loras": {
"loaded": 8,
"total_available": 24,
"memory_used_mb": 160
},
"workers": {
"total": 4,
"busy": 3,
"idle": 1,
"utilization": 0.75
},
"gpu": {
"utilization": 0.89,
"memory_used_mb": 18456,
"memory_total_mb": 24576
}
}GET /api/v1/llm/cache/stats
Get cache performance statistics.
Response:
{
"response_cache": {
"hits": 12456,
"misses": 2778,
"hit_rate": 0.818,
"total_entries": 5432,
"memory_used_mb": 890,
"avg_lookup_time_ms": 1.8
},
"prefix_cache": {
"hits": 8934,
"misses": 4823,
"hit_rate": 0.649,
"total_entries": 2145,
"memory_used_mb": 125,
"avg_tokens_saved": 45.3
},
"model_metadata_cache": {
"hits": 45678,
"misses": 123,
"hit_rate": 0.997,
"total_entries": 5
},
"lora_metadata_cache": {
"hits": 23456,
"misses": 245,
"hit_rate": 0.990,
"total_entries": 24
},
"kv_cache_buffer_pool": {
"total_buffers": 8,
"active_buffers": 4,
"buffer_reuse_count": 12456
}
}GET /api/v1/llm/workers
Get per-worker statistics.
Response:
{
"workers": [
{
"worker_id": 0,
"status": "busy",
"current_request_id": "req_abc123",
"requests_processed": 3821,
"total_processing_time_ms": 456782,
"avg_processing_time_ms": 119.5,
"utilization": 0.92
},
{
"worker_id": 1,
"status": "idle",
"requests_processed": 3756,
"total_processing_time_ms": 441234,
"avg_processing_time_ms": 117.5,
"utilization": 0.88
}
]
}GET /api/v1/llm/health
Check LLM service health.
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T12:45:30Z",
"checks": {
"models_loaded": true,
"workers_active": true,
"gpu_available": true,
"queue_ok": true
}
}Status Codes:
-
200 OK: Healthy -
503 Service Unavailable: Unhealthy
POST /api/v1/llm/cache/clear
Clear caches (response, prefix, or all).
Request Body:
{
"cache_type": "response"
}Cache Types:
-
response: Response cache only -
prefix: Prefix cache only -
all: All caches
Response:
{
"cleared": "response",
"entries_removed": 5432,
"memory_freed_mb": 890
}All error responses follow this format:
{
"error": {
"code": "MODEL_NOT_FOUND",
"message": "Model 'invalid-model' not found",
"details": {
"available_models": ["mistral-7b", "llama-3-8b"]
}
}
}Error Codes:
-
INVALID_REQUEST: Malformed request -
MODEL_NOT_FOUND: Requested model not found -
LORA_NOT_FOUND: Requested LoRA not found -
MODEL_LOAD_FAILED: Failed to load model -
INFERENCE_FAILED: Inference error -
QUEUE_FULL: Request queue full -
INSUFFICIENT_MEMORY: Not enough memory -
INVALID_PARAMETERS: Invalid inference parameters
API endpoints are rate-limited per API key:
Headers:
-
X-RateLimit-Limit: Maximum requests per minute -
X-RateLimit-Remaining: Remaining requests in current window -
X-RateLimit-Reset: Timestamp when limit resets
Response (429 Too Many Requests):
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Retry after 42 seconds.",
"retry_after_seconds": 42
}
}All API requests require Bearer Token authentication.
Header:
Authorization: Bearer <token>
Example:
curl -X POST http://localhost:8080/api/v1/llm/inference \
-H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." \
-H "Content-Type: application/json" \
-d '{"prompt": "What is ThemisDB?", "model": "mistral-7b"}'Token Format: JWT (JSON Web Token)
Token Acquisition: Obtain from ThemisDB authentication endpoint:
curl -X POST http://localhost:8080/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "user", "password": "pass"}' \
| jq -r '.token'Token Expiration: Configurable (default: 24 hours)
Unauthorized Response (401):
{
"error": {
"code": "UNAUTHORIZED",
"message": "Invalid or expired token"
}
}API version is included in URL: /api/v1/llm/*
Future versions will use /api/v2/llm/*, etc.
Simple Inference:
curl -X POST http://localhost:8080/api/v1/llm/inference \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"prompt": "What is ThemisDB?",
"model": "mistral-7b",
"max_tokens": 100
}'RAG Query:
curl -X POST http://localhost:8080/api/v1/llm/rag \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"query": "Contract provisions in clause 3.4",
"collection": "legal_docs",
"top_k": 5,
"lora_adapter": "legal-qa"
}'Model Upload:
curl -X POST http://localhost:8080/api/v1/llm/models/ingest \
-H "Authorization: Bearer <token>" \
-F "model_id=llama-3-8b" \
-F "file=@/path/to/llama-3-8b.gguf" \
-F 'metadata={"version":"v1.0","replicate":true}'- Use streaming for long responses to improve perceived latency
- Leverage caching by structuring similar prompts consistently
- Pre-load frequently used models and LoRAs to avoid cold starts
- Monitor cache hit rates and adjust similarity thresholds
- Use batch inference via multiple concurrent requests for throughput
- Set appropriate timeouts (recommend 30s for inference, 5min for model loading)
- Handle 429 errors with exponential backoff
- Use RAG endpoint instead of manual vector search + inference
- Response caching: 75x speedup for cache hits
- Prefix caching: 65% hit rate, ~45 tokens saved per hit
- Concurrent requests: 128 req/s with 4 workers
- Model loading: ~3s cold start, ~0ms cached
- LoRA switching: ~5ms per switch
- Übersicht
- Home
- 📋 Dokumentations-Index
- 📋 Quick Reference
- 📊 Sachstandsbericht 2025
- 🚀 Features
- 🗺️ Roadmap
- Ecosystem Overview
- Strategische Übersicht
- Architektur
- Basismodell
- Storage & MVCC
- Indexe & Statistiken
- Query & AQL
- Caching
- Content Pipeline
- Suche
- Performance & Benchmarks
- Enterprise Features
- Qualitätssicherung
- Vektor & GNN
- Geo Features
- Sicherheit & Governance
- Überblick
- RBAC & Authorization
- RBAC
- Policies (MVP)
- Authentication
- Schlüsselverwaltung
- Verschlüsselung
- TLS & Certificates
- PKI & Signatures
- PII Detection
- Vault & HSM
- Audit & Compliance
- Security Audits & Hardening
- Competitive Gap Analysis
- Deployment & Betrieb
- Deployment
- Docker
- Tracing & Observability
- Observability
- Change Data Capture
- Operations Runbook
- Infrastructure Roadmap
- Horizontal Scaling Implementation Strategy
- Entwicklung
- Übersicht
- Code Quality Pipeline
- Developers Guide
- Cost Models
- Todo Liste
- Tool Todo
- Core Feature Todo
- Priorities
- Implementation Status
- Roadmap
- Future Work
- Next Steps Analysis
- AQL LET Implementation Guide
- Development Audit
- Sprint Summary (2025-11-17)
- WAL Archiving
- Search Gap Analysis
- Source Documentation Plan
- API Implementations
- Changefeed
- Security Development
- Development Overviews
- Publikation & Ablage
- Admin-Tools
- APIs
- Client SDKs
- Implementierungs-Zusammenfassungen
- Planung & Reports
- Dokumentation
- Release Notes
- Styleguide & Glossar
- Roadmap
- Changelog
- Source Code Documentation
- Übersicht
- Source Documentation
- Main
- Main (Detailed)
- Main Server
- Main Server (Detailed)
- Demo Encryption
- Demo Encryption (Detailed)
- API
- Authentication
- Cache
- CDC
- Content
- Geo
- Governance
- Index
- LLM
- Query
- Security
- Server
- Server README
- [VCCDB Design](src/server/VCCDB Design.md.md)
- Audit API Handler
- Auth Middleware
- Classification API Handler
- HTTP Server
- Keys API Handler
- PII API Handler
- Policy Engine
- Ranger Adapter
- Reports API Handler
- Retention API Handler
- SAGA API Handler
- SSE Connection Manager
- Storage
- Time Series
- Transaction
- Utils
- Archive