Skip to content

Conversation

@sigridjineth
Copy link

@sigridjineth sigridjineth commented Dec 4, 2025

Summary

This PR integrates XProvence (naver/xprovence-reranker-bgem3-v1), a zero-cost context pruning model for RAG. The model scores sentences by query relevance and removes irrelevant ones, returning both reranking scores and pruned_text (the pruned context).

Motivation

In RAG pipelines, retrieved documents often include distracting content that confuses LLMs and wastes tokens. XProvence mitigates this by:

  • Providing sentence-level relevance scoring
  • Pruning irrelevant sentences while preserving key content
  • Reducing token usage without sacrificing answer quality

Changes

Python Backend (backends/python/)

  • Add XProvenceModel class with process() for sentence-level pruning
  • Add pruned_text field to Score type
  • Make flash_attn imports optional for environments without flash attention
  • Handle bfloat16 → float32 conversion (XProvence process() requires float32)

Core (core/)

  • Pass raw_query and raw_text through the tokenization pipeline for pruning
  • Include pruned_text in inference results

Router (router/)

  • Detect XProvence architecture
  • Include pruned_text in HTTP rerank response

gRPC (backends/grpc-client/, backends/proto/)

  • Add pruned_text field to protobuf definitions
  • Update gRPC client to handle pruned text

Files Changed

  • backends/python/.../xprovence_model.py: New XProvence model implementation
  • backends/python/.../models/__init__.py: Model detection and optional flash_attn import
  • backends/python/.../models/types.py: Add pruned_text to Score
  • backends/proto/embed.proto: Add pruned_text to protobuf
  • core/src/tokenization.rs: Pass raw text for pruning
  • core/src/infer.rs: Handle pruned_text in results
  • core/src/queue.rs: Store raw text in queue entries
  • router/src/http/types.rs: Add pruned_text to response type
  • router/src/http/server.rs: Include pruned_text in rerank response

Configuration

  • XPROVENCE_THRESHOLD: Pruning threshold 0.0–1.0 (default: 0.3)
    • Lower = more conservative (keeps more sentences)
    • Higher = more aggressive (removes more sentences)
  • XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage

XPROVENCE_THRESHOLD=0.3 \
XPROVENCE_ALWAYS_SELECT_TITLE=true \
text-embeddings-router --model-id naver/xprovence-reranker-bgem3-v1 --port 8080

API Example

Request

curl http://localhost:8080/rerank -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "What is deep learning?",
    "texts": [
      "Deep learning uses neural networks. The weather is nice. I like pizza."
    ],
    "return_text": true
  }'

Response

[
  {
    "index": 0,
    "text": "Deep learning uses neural networks. The weather is nice. I like pizza.",
    "score": 0.9997,
    "pruned_text": "Deep learning uses neural networks."
  }
]

Test Plan

  • Server starts successfully with the XProvence model
  • Rerank endpoint returns correct scores
  • pruned_text contains only relevant sentences
  • Irrelevant sentences are removed
  • Works with Korean/multilingual text
  • Graceful fallback when pruning fails

References

@sigridjineth sigridjineth force-pushed the provenance branch 3 times, most recently from 5631b2e to 89441fe Compare December 5, 2025 10:22
Add XProvence model integration for zero-cost context pruning in reranking.
XProvence removes irrelevant sentences from passages based on query relevance,
returning both reranking scores and pruned context.

Changes:
- Add XProvenceModel class with process() method for sentence-level pruning
- Add pruned_text field to Score type and HTTP response
- Pass raw_query/raw_text through tokenization pipeline for pruning
- Make flash_attn imports optional for XProvence compatibility
- Add XProvence architecture detection in router and Python backend
- Handle bfloat16 to float32 conversion for XProvence process() method

Configuration:
- XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3)
- XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage:
  XPROVENCE_THRESHOLD=0.3 text-embeddings-router \
    --model-id naver/xprovence-reranker-bgem3-v1 --port 8080
@sigridjineth sigridjineth changed the title feat: xprovenance feat: Add XProvence Context Pruning Support Dec 5, 2025
@sigridjineth sigridjineth changed the title feat: Add XProvence Context Pruning Support Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) Dec 5, 2025
@sigridjineth sigridjineth marked this pull request as ready for review December 5, 2025 10:32
Sigrid Jin and others added 15 commits December 6, 2025 10:00
Add XProvence model integration for zero-cost context pruning in reranking.
XProvence removes irrelevant sentences from passages based on query relevance,
returning both reranking scores and pruned context.

Changes:
- Add XProvenceModel class with process() method for sentence-level pruning
- Add pruned_text field to Score/Prediction types and HTTP response
- Pass raw_query/raw_text through tokenization pipeline for pruning
- Make flash_attn imports optional for XProvence compatibility
- Add XProvence architecture detection in router and Python backend
- Handle bfloat16 to float32 conversion for XProvence process() method
- Update candle, ort backends to support Prediction with pruned_text
- Add Dockerfile-cuda-python for Python backend with CUDA support

Configuration:
- XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3)
- XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage:
  XPROVENCE_THRESHOLD=0.3 text-embeddings-router \
    --model-id naver/xprovence-reranker-bgem3-v1 --port 8080

Docker build:
  docker build -f Dockerfile-cuda-python -t tei-python-cuda .
The previous fix (7ff382c) incorrectly passed config from AutoConfig.from_pretrained
to AutoModel.from_pretrained. Since XProvence's config.json lacks auto_map for
AutoConfig, it returned XLMRobertaConfig while the model expected XProvenceConfig.

New approach:
- Extract model_id from cache path (e.g., naver/xprovence-reranker-bgem3-v1)
- Use model_id directly with AutoModel.from_pretrained(model_id, trust_remote_code=True)
- Let AutoModel handle config internally via model class's config_class attribute
- Remove explicit config passing and snapshot_download (AutoModel handles downloads)
The previous fix still failed because __init__.py called AutoConfig.from_pretrained
before XProvenceModel was created. This polluted transformers' internal config
registry with XLMRobertaConfig, causing conflicts when XProvenceModel tried to
load the custom XProvenceConfig.

Solution:
- Add _is_xprovence_model() helper that reads config.json directly
- Check for XProvence BEFORE calling AutoConfig.from_pretrained
- This prevents transformers from caching the wrong config class
AutoModel.from_pretrained internally calls AutoConfig which returns
XLMRobertaConfig, causing a conflict with the model's XProvenceConfig.

Solution: Use transformers.dynamic_module_utils.get_class_from_dynamic_module
to directly import the custom XProvenceForSequenceClassification class,
then call from_pretrained on the custom class which uses its own config_class.
Previously, only the first raw_query/raw_text was sent to Python backend,
so process() was only called when batch_size == 1. Now all pairs are sent.

Changes:
- embed.proto: change to repeated string raw_queries/raw_texts
- grpc-client: accept Vec<String> instead of Option<String>
- backends/python/src/lib.rs: send all raw_queries/texts from batch
- types.py: extract lists from proto repeated fields
- xprovence_model.py: iterate batch and call process() for each pair

Now /rerank with multiple texts returns pruned_text for each result.
- Add broadcasting support: 1 query → N texts (common reranking pattern)
- Replace silent fallback with explicit warning on dimension mismatch
- Use torch.inference_mode() around entire batch for better performance
- Reduce per-item overhead by batching dtype handling and TQDM_DISABLE
- Add per-item error handling with graceful fallback to 0.0 score

Performance improvements:
- Single dtype context switch instead of per-item
- Single inference_mode context for entire batch
- Reduced logging overhead with debug level for per-item details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant