Note
Thank you for visiting! This project is currently a work in progress. Features, documentation, and deployment configurations are actively being developed and may change frequently.
OmniPDF is a PDF analyzer capable of translation, summarization, and captioning.
OmniPDF follows a microservices architecture with centralized orchestration:
- pdf-processor-service: Main hub that coordinates all processing workflows
- Processing services: Specialized services for extraction, translation, rendering, and embedding
- Data layer: Redis (sessions), ChromaDB (vectors), MinIO (files)
- AI/ML layer: vLLM text and vision-language models
- Service mesh layer: Istio for mTLS, traffic management, and observability (prestaging/staging/production)
OmniPDF supports multiple deployment environments with Kubernetes + Helm:
- Development: Docker Compose for local development
- Pre-staging: CodeReady Containers (CRC) with Istio Service Mesh + Helm charts
- Staging: Offline OpenShift Container Platform (OCP) with organization's Istio + Helm
- Production: Offline OpenShift Container Platform (OCP) with organization's Istio + Helm
Container Registry Patterns:
- Development: Local Docker images
- Pre-staging:
default-route-openshift-image-registry.apps-crc.testing/omnipdf/SERVICE_NAME - Staging/Production: Internal/disconnected registries (images must be pre-mirrored)
# Start all services
docker compose up --build
# Start with GPU support (for LLM services)
docker compose -f docker-compose.gpu.yml up --build# Deploy individual service with explicit environment
helm install pdf-extraction-service ./helm/pdf-extraction-service \
--values ./helm/pdf-extraction-service/values-prestaging.yaml \
--namespace omnipdf
# Deploy all services using deployment script
./scripts/deploy-helm-charts.sh --all --env prestaging
# Deploy RBAC only (13 individual service roles - should be deployed first)
./scripts/deploy-helm-charts.sh --service rbac --env prestagingFor prestaging environment in CRC with full service mesh capabilities:
# 1. Install Istio control plane
./istio-1.27.1/bin/istioctl install --set values.defaultRevision=default -y
# 2. Create namespace with sidecar injection
oc create namespace omnipdf-prestaging
oc label namespace omnipdf-prestaging istio-injection=enabled
# 3. Deploy Istio Gateway and routing
helm install istio-gateway ./helm/istio-gateway \
--namespace omnipdf-prestaging \
--values ./helm/istio-gateway/values-prestaging.yaml
# 4. Deploy RBAC first (individual service roles)
helm install rbac ./helm/rbac \
--namespace omnipdf-prestaging
# 5. Deploy services with Istio sidecars
for service in frontend pdf-processor-service embedder-service chromadb redis minio cleaner pdf-extraction-service docling-translation-service pdf-renderer-service image-captioner-service metadata-service; do
helm install $service ./helm/$service \
--namespace omnipdf-prestaging \
--values ./helm/$service/values-prestaging.yaml
doneIstio Features Enabled:
- mTLS: Automatic mutual TLS between all services
- Traffic Management: Intelligent routing and load balancing
- Observability: Distributed tracing and metrics
- Security Policies: Fine-grained access control
See helm/istio-gateway/INSTALL.md for detailed setup instructions.
OmniPDF implements defense-in-depth security with multiple layers:
- Individual service accounts for each service with per-service secret isolation
- 13 individual RBAC roles - one role per service aligned with C4 architecture:
pdf-processor-service-role,pdf-extraction-service-role,docling-translation-service-roleembedder-service-role,pdf-renderer-service-roleimage-captioner-service-role,metadata-service-roleminio-role,chromadb-role,redis-rolefrontend-role,nginx-gateway-role,cleaner-role
- Zero-trust security - each service accesses only required services per C4 diagram
- Complete audit trail for inter-service communication
OmniPDF implements comprehensive zero-trust network policies with explicit service-to-service communication rules:
| Service | Ingress (Who can call this service) | Egress (What this service can call) |
|---|---|---|
| nginx | • External traffic (users) | • istio-gateway:80/443 • DNS resolution |
| istio-gateway | • nginx | • frontend:8501 • pdf-processor-service:8000 • DNS resolution |
| frontend | • istio-gateway | • pdf-processor-service:8000 • DNS resolution |
| pdf-processor-service | • istio-gateway • frontend |
• pdf-extraction-service:8000 • docling-translation-service:8000 • pdf-renderer-service:8000 • embedder-service:8000 • metadata-service:8000 • minio:9000 • redis:6379 • DNS resolution |
| pdf-extraction-service | • pdf-processor-service | • image-captioner-service:8000 • minio:9000 • redis:6379 • DNS resolution |
| docling-translation-service | • pdf-processor-service | • minio:9000 • redis:6379 • DNS resolution • HTTP/HTTPS (external vLLM text model) |
| pdf-renderer-service | • pdf-processor-service | • minio:9000 • redis:6379 • DNS resolution |
| embedder-service | • pdf-processor-service | • chromadb:8000 • minio:9000 • redis:6379 • DNS resolution |
| image-captioner-service | • pdf-extraction-service | • DNS resolution • HTTP/HTTPS (external vLLM vision model) |
| metadata-service | • pdf-processor-service | • chromadb:8000 • minio:9000 • redis:6379 • DNS resolution • HTTP/HTTPS (external vLLM text model) |
| cleaner | No ingress (background service) | • minio:9000 • chromadb:8000 • redis:6379 • DNS resolution |
| chromadb | • embedder-service • metadata-service • cleaner |
• DNS resolution No outbound calls |
| redis | • pdf-processor-service • pdf-extraction-service • docling-translation-service • embedder-service • pdf-renderer-service • metadata-service • cleaner |
• DNS resolution No outbound calls |
| minio | • pdf-processor-service • pdf-extraction-service • docling-translation-service • pdf-renderer-service • embedder-service • metadata-service • cleaner |
• DNS resolution No outbound calls |
| Environment | NetworkPolicy | Service Mesh | Description |
|---|---|---|---|
| Development | Disabled | None | Docker Compose - no network restrictions for local dev |
| Prestaging | Enabled | Own Istio | Zero-trust + mTLS within service mesh |
| Staging | Enabled | Org Istio | Zero-trust policies + organization's service mesh |
| Production | Enabled | Org Istio | Strict segmentation + organization's service mesh |
- Service Mesh Gateway: Istio Gateway handles external traffic in prestaging/staging/production
- API Gateway: nginx provides application-level routing (development) or internal routing (with Istio)
- Orchestration Hub: pdf-processor-service coordinates workflows across processing services
- Data Layer Security: Restricted access to chromadb (vectors), redis (sessions), and minio (files)
- mTLS Communication: Automatic mutual TLS between all services in service mesh environments
- Background Services: cleaner operates with minimal network permissions for cleanup tasks
- External Connectivity: Managed external vLLM/AI API access through ServiceEntry (Istio) or HTTPS egress
- 8 services with auto-scaling enabled across 3 tiers:
- Tier 1 (Critical): nginx, pdf-processor-service - aggressive scaling (60-70% thresholds)
- Tier 2 (Processing): pdf-extraction, docling-translation, pdf-renderer - standard scaling (70% thresholds)
- Tier 3 (Burst): embedder-service, image-captioner-service, metadata-service - conservative scaling (70% thresholds)
- High availability: Minimum 1-2 replicas with scaling up to 5-15 replicas based on service tier
- Resource optimization: Proactive scaling for user-facing services, workload-responsive for processing services
# Enable NetworkPolicy for production
helm upgrade pdf-extraction-service ./helm/pdf-extraction-service \
--set networkPolicy.enabled=true \
--namespace omnipdf
# Check service account permissions
kubectl auth can-i get secrets \
--as=system:serviceaccount:omnipdf:pdf-extraction-service \
-n omnipdf
# Monitor HPA status
kubectl get hpa -n omnipdfOmniPDF uses Red Hat CodeReady Containers (CRC) for local OpenShift development. Due to the resource-intensive nature of running 8+ microservices, CRC requires significant CPU and memory allocation.
# Run the automated setup script
./config/crc/setup-crc.sh
# Start CRC with configured settings
crc start
# Set up oc environment
eval $(crc oc-env)
# Get login credentials and login
crc console --credentials
oc login -u kubeadmin -p <password> https://api.crc.testing:6443 --insecure-skip-tls-verifyAlternatively, configure CRC manually:
# Stop CRC if running
crc stop
# Configure CRC resources (adjust based on your system)
crc config set memory 32768 # 32GB RAM (adjust based on your system)
crc config set cpus 12 # 12 CPU cores (adjust based on your system)
crc config set disk-size 120 # 120GB disk (increased for ML workloads)
# Start CRC with new configuration
crc start- Memory: 256GB recommended for running all microservices without constraints
- CPU: 32 cores provides abundant processing power for OpenShift + services
- Disk: 120GB recommended for container images, ML models, and persistent data
- Configuration saved: Current settings stored in
config/crc/crc-config.txt
# Check CRC status
crc status
# Check node resources
oc describe node crc | grep -A 10 "Allocated resources"
# View current configuration
crc config view# Run all service unit tests (180+ tests across 6 services)
./scripts/test-all-services.sh
# Run tests for individual service
./scripts/test-single-service.sh pdf-extraction-service
# Security scanning with Trivy
./scripts/scan_with_trivy.sh
# Lint all Helm charts
find helm -maxdepth 1 -type d ! -name 'assets' ! -name 'helm' -exec helm lint {} \;This project uses a Makefile to simplify common Helm and Kubernetes operations.
To get started, run:
make help