-
Notifications
You must be signed in to change notification settings - Fork 325
Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) #770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sigridjineth
wants to merge
16
commits into
huggingface:main
Choose a base branch
from
sigridjineth:provenance
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5631b2e to
89441fe
Compare
Add XProvence model integration for zero-cost context pruning in reranking.
XProvence removes irrelevant sentences from passages based on query relevance,
returning both reranking scores and pruned context.
Changes:
- Add XProvenceModel class with process() method for sentence-level pruning
- Add pruned_text field to Score type and HTTP response
- Pass raw_query/raw_text through tokenization pipeline for pruning
- Make flash_attn imports optional for XProvence compatibility
- Add XProvence architecture detection in router and Python backend
- Handle bfloat16 to float32 conversion for XProvence process() method
Configuration:
- XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3)
- XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)
Usage:
XPROVENCE_THRESHOLD=0.3 text-embeddings-router \
--model-id naver/xprovence-reranker-bgem3-v1 --port 8080
89441fe to
fea70ee
Compare
fea70ee to
5be80a0
Compare
Add XProvence model integration for zero-cost context pruning in reranking.
XProvence removes irrelevant sentences from passages based on query relevance,
returning both reranking scores and pruned context.
Changes:
- Add XProvenceModel class with process() method for sentence-level pruning
- Add pruned_text field to Score/Prediction types and HTTP response
- Pass raw_query/raw_text through tokenization pipeline for pruning
- Make flash_attn imports optional for XProvence compatibility
- Add XProvence architecture detection in router and Python backend
- Handle bfloat16 to float32 conversion for XProvence process() method
- Update candle, ort backends to support Prediction with pruned_text
- Add Dockerfile-cuda-python for Python backend with CUDA support
Configuration:
- XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3)
- XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)
Usage:
XPROVENCE_THRESHOLD=0.3 text-embeddings-router \
--model-id naver/xprovence-reranker-bgem3-v1 --port 8080
Docker build:
docker build -f Dockerfile-cuda-python -t tei-python-cuda .
The previous fix (7ff382c) incorrectly passed config from AutoConfig.from_pretrained to AutoModel.from_pretrained. Since XProvence's config.json lacks auto_map for AutoConfig, it returned XLMRobertaConfig while the model expected XProvenceConfig. New approach: - Extract model_id from cache path (e.g., naver/xprovence-reranker-bgem3-v1) - Use model_id directly with AutoModel.from_pretrained(model_id, trust_remote_code=True) - Let AutoModel handle config internally via model class's config_class attribute - Remove explicit config passing and snapshot_download (AutoModel handles downloads)
The previous fix still failed because __init__.py called AutoConfig.from_pretrained before XProvenceModel was created. This polluted transformers' internal config registry with XLMRobertaConfig, causing conflicts when XProvenceModel tried to load the custom XProvenceConfig. Solution: - Add _is_xprovence_model() helper that reads config.json directly - Check for XProvence BEFORE calling AutoConfig.from_pretrained - This prevents transformers from caching the wrong config class
AutoModel.from_pretrained internally calls AutoConfig which returns XLMRobertaConfig, causing a conflict with the model's XProvenceConfig. Solution: Use transformers.dynamic_module_utils.get_class_from_dynamic_module to directly import the custom XProvenceForSequenceClassification class, then call from_pretrained on the custom class which uses its own config_class.
Previously, only the first raw_query/raw_text was sent to Python backend, so process() was only called when batch_size == 1. Now all pairs are sent. Changes: - embed.proto: change to repeated string raw_queries/raw_texts - grpc-client: accept Vec<String> instead of Option<String> - backends/python/src/lib.rs: send all raw_queries/texts from batch - types.py: extract lists from proto repeated fields - xprovence_model.py: iterate batch and call process() for each pair Now /rerank with multiple texts returns pruned_text for each result.
- Add broadcasting support: 1 query → N texts (common reranking pattern) - Replace silent fallback with explicit warning on dimension mismatch - Use torch.inference_mode() around entire batch for better performance - Reduce per-item overhead by batching dtype handling and TQDM_DISABLE - Add per-item error handling with graceful fallback to 0.0 score Performance improvements: - Single dtype context switch instead of per-item - Single inference_mode context for entire batch - Reduced logging overhead with debug level for per-item details
5be80a0 to
42654a4
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates XProvence (naver/xprovence-reranker-bgem3-v1), a zero-cost context pruning model for RAG. The model scores sentences by query relevance and removes irrelevant ones, returning both reranking scores and
pruned_text(the pruned context).Motivation
In RAG pipelines, retrieved documents often include distracting content that confuses LLMs and wastes tokens. XProvence mitigates this by:
Changes
Python Backend (backends/python/)
XProvenceModelclass withprocess()for sentence-level pruningpruned_textfield toScoretypeflash_attnimports optional for environments without flash attentionbfloat16 → float32conversion (XProvenceprocess()requires float32)Core (core/)
raw_queryandraw_textthrough the tokenization pipeline for pruningpruned_textin inference resultsRouter (router/)
pruned_textin HTTP rerank responsegRPC (backends/grpc-client/, backends/proto/)
pruned_textfield to protobuf definitionsFiles Changed
backends/python/.../xprovence_model.py: New XProvence model implementationbackends/python/.../models/__init__.py: Model detection and optionalflash_attnimportbackends/python/.../models/types.py: Addpruned_texttoScorebackends/proto/embed.proto: Addpruned_textto protobufcore/src/tokenization.rs: Pass raw text for pruningcore/src/infer.rs: Handlepruned_textin resultscore/src/queue.rs: Store raw text in queue entriesrouter/src/http/types.rs: Addpruned_textto response typerouter/src/http/server.rs: Includepruned_textin rerank responseConfiguration
XPROVENCE_THRESHOLD: Pruning threshold0.0–1.0(default:0.3)XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default:true)Usage
API Example
Request
Response
[ { "index": 0, "text": "Deep learning uses neural networks. The weather is nice. I like pizza.", "score": 0.9997, "pruned_text": "Deep learning uses neural networks." } ]Test Plan
pruned_textcontains only relevant sentencesReferences