Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770

sigridjineth · 2025-12-04T23:16:52Z

Summary

This PR integrates XProvence (naver/xprovence-reranker-bgem3-v1), a zero-cost context pruning model for RAG. The model scores sentences by query relevance and removes irrelevant ones, returning both reranking scores and pruned_text (the pruned context).

Motivation

In RAG pipelines, retrieved documents often include distracting content that confuses LLMs and wastes tokens. XProvence mitigates this by:

Providing sentence-level relevance scoring
Pruning irrelevant sentences while preserving key content
Reducing token usage without sacrificing answer quality

Changes

Python Backend (backends/python/)

Add XProvenceModel class with process() for sentence-level pruning
Add pruned_text field to Score type
Make flash_attn imports optional for environments without flash attention
Handle bfloat16 → float32 conversion (XProvence process() requires float32)

Core (core/)

Pass raw_query and raw_text through the tokenization pipeline for pruning
Include pruned_text in inference results

Router (router/)

Detect XProvence architecture
Include pruned_text in HTTP rerank response

gRPC (backends/grpc-client/, backends/proto/)

Add pruned_text field to protobuf definitions
Update gRPC client to handle pruned text

Files Changed

backends/python/.../xprovence_model.py: New XProvence model implementation
backends/python/.../models/__init__.py: Model detection and optional flash_attn import
backends/python/.../models/types.py: Add pruned_text to Score
backends/proto/embed.proto: Add pruned_text to protobuf
core/src/tokenization.rs: Pass raw text for pruning
core/src/infer.rs: Handle pruned_text in results
core/src/queue.rs: Store raw text in queue entries
router/src/http/types.rs: Add pruned_text to response type
router/src/http/server.rs: Include pruned_text in rerank response

Configuration

XPROVENCE_THRESHOLD: Pruning threshold 0.0–1.0 (default: 0.3)
- Lower = more conservative (keeps more sentences)
- Higher = more aggressive (removes more sentences)
XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage

XPROVENCE_THRESHOLD=0.3 \
XPROVENCE_ALWAYS_SELECT_TITLE=true \
text-embeddings-router --model-id naver/xprovence-reranker-bgem3-v1 --port 8080

API Example

Request

curl http://localhost:8080/rerank -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "What is deep learning?",
    "texts": [
      "Deep learning uses neural networks. The weather is nice. I like pizza."
    ],
    "return_text": true
  }'

Response

[
  {
    "index": 0,
    "text": "Deep learning uses neural networks. The weather is nice. I like pizza.",
    "score": 0.9997,
    "pruned_text": "Deep learning uses neural networks."
  }
]

Test Plan

Server starts successfully with the XProvence model
Rerank endpoint returns correct scores
pruned_text contains only relevant sentences
Irrelevant sentences are removed
Works with Korean/multilingual text
Graceful fallback when pruning fails

References

Model: https://huggingface.co/naver/xprovence-reranker-bgem3-v1
Paper: XProvence — Zero-cost Context Pruning for RAG

Add XProvence model integration for zero-cost context pruning in reranking. XProvence removes irrelevant sentences from passages based on query relevance, returning both reranking scores and pruned context. Changes: - Add XProvenceModel class with process() method for sentence-level pruning - Add pruned_text field to Score type and HTTP response - Pass raw_query/raw_text through tokenization pipeline for pruning - Make flash_attn imports optional for XProvence compatibility - Add XProvence architecture detection in router and Python backend - Handle bfloat16 to float32 conversion for XProvence process() method Configuration: - XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3) - XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true) Usage: XPROVENCE_THRESHOLD=0.3 text-embeddings-router \ --model-id naver/xprovence-reranker-bgem3-v1 --port 8080

Add XProvence model integration for zero-cost context pruning in reranking. XProvence removes irrelevant sentences from passages based on query relevance, returning both reranking scores and pruned context. Changes: - Add XProvenceModel class with process() method for sentence-level pruning - Add pruned_text field to Score/Prediction types and HTTP response - Pass raw_query/raw_text through tokenization pipeline for pruning - Make flash_attn imports optional for XProvence compatibility - Add XProvence architecture detection in router and Python backend - Handle bfloat16 to float32 conversion for XProvence process() method - Update candle, ort backends to support Prediction with pruned_text - Add Dockerfile-cuda-python for Python backend with CUDA support Configuration: - XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3) - XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true) Usage: XPROVENCE_THRESHOLD=0.3 text-embeddings-router \ --model-id naver/xprovence-reranker-bgem3-v1 --port 8080 Docker build: docker build -f Dockerfile-cuda-python -t tei-python-cuda .

The previous fix (7ff382c) incorrectly passed config from AutoConfig.from_pretrained to AutoModel.from_pretrained. Since XProvence's config.json lacks auto_map for AutoConfig, it returned XLMRobertaConfig while the model expected XProvenceConfig. New approach: - Extract model_id from cache path (e.g., naver/xprovence-reranker-bgem3-v1) - Use model_id directly with AutoModel.from_pretrained(model_id, trust_remote_code=True) - Let AutoModel handle config internally via model class's config_class attribute - Remove explicit config passing and snapshot_download (AutoModel handles downloads)

The previous fix still failed because __init__.py called AutoConfig.from_pretrained before XProvenceModel was created. This polluted transformers' internal config registry with XLMRobertaConfig, causing conflicts when XProvenceModel tried to load the custom XProvenceConfig. Solution: - Add _is_xprovence_model() helper that reads config.json directly - Check for XProvence BEFORE calling AutoConfig.from_pretrained - This prevents transformers from caching the wrong config class

AutoModel.from_pretrained internally calls AutoConfig which returns XLMRobertaConfig, causing a conflict with the model's XProvenceConfig. Solution: Use transformers.dynamic_module_utils.get_class_from_dynamic_module to directly import the custom XProvenceForSequenceClassification class, then call from_pretrained on the custom class which uses its own config_class.

…eClassification)

Previously, only the first raw_query/raw_text was sent to Python backend, so process() was only called when batch_size == 1. Now all pairs are sent. Changes: - embed.proto: change to repeated string raw_queries/raw_texts - grpc-client: accept Vec<String> instead of Option<String> - backends/python/src/lib.rs: send all raw_queries/texts from batch - types.py: extract lists from proto repeated fields - xprovence_model.py: iterate batch and call process() for each pair Now /rerank with multiple texts returns pruned_text for each result.

- Add broadcasting support: 1 query → N texts (common reranking pattern) - Replace silent fallback with explicit warning on dimension mismatch - Use torch.inference_mode() around entire batch for better performance - Reduce per-item overhead by batching dtype handling and TQDM_DISABLE - Add per-item error handling with graceful fallback to 0.0 score Performance improvements: - Single dtype context switch instead of per-item - Single inference_mode context for entire batch - Reduced logging overhead with debug level for per-item details

sigridjineth force-pushed the provenance branch 3 times, most recently from 5631b2e to 89441fe Compare December 5, 2025 10:22

sigridjineth force-pushed the provenance branch from 89441fe to fea70ee Compare December 5, 2025 10:29

sigridjineth changed the title ~~feat: xprovenance~~ feat: Add XProvence Context Pruning Support Dec 5, 2025

sigridjineth changed the title ~~feat: Add XProvence Context Pruning Support~~ Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) Dec 5, 2025

sigridjineth marked this pull request as ready for review December 5, 2025 10:32

sigridjineth force-pushed the provenance branch from fea70ee to 5be80a0 Compare December 6, 2025 10:00

Sigrid Jin and others added 15 commits December 6, 2025 10:00

Create revision

7f37c49

Remove unused assets

1287a94

Remove unused assets

367a696

fix: snapshots

795cf52

feat: add spacy dependency for XProvence sentence tokenization

1b0aba0

fix: load XProvenceConfig before model to avoid config class mismatch

7ff382c

fix: correct XProvence class name (XProvence, not XProvenceForSequenc…

8016c4f

…eClassification)

debug: add logging to XProvence predict to diagnose raw_query/raw_text

6765247

fix: use HasField for proto3 optional fields in types.py

cc0b4e5

sigridjineth force-pushed the provenance branch from 5be80a0 to 42654a4 Compare December 7, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770

Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770

Uh oh!

sigridjineth commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770

Are you sure you want to change the base?

Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770

Uh oh!

Conversation

sigridjineth commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Python Backend (backends/python/)

Core (core/)

Router (router/)

gRPC (backends/grpc-client/, backends/proto/)

Files Changed

Configuration

Usage

API Example

Request

Response

Test Plan

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sigridjineth commented Dec 4, 2025 •

edited

Loading