-
Notifications
You must be signed in to change notification settings - Fork 396
Feature/tvd mi metric #1080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/tvd mi metric #1080
Conversation
| return [{"role": "user", "content": content}] | ||
|
|
||
|
|
||
| def process_judge_response_tvdmi(response: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def process_judge_response_tvdmi(response: str) -> int: | |
| def process_judge_response_tvdmi(response: str | None) -> int: |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
hey @zrobertson466920 thanks for the PR !! Do you have an example of a benchmark that would use this / a use case ? Also in a effort of standardisation we are trying to have |
|
@NathanHB Thanks a lot for the review and the suggestions!
Yes, the main use case is label-free, information-oriented evaluation on paired responses, especially when we have multiple model outputs per item but no ground-truth reference. At a high level, TVD-MI provides a symmetry-aware way to measure the reliability and self-consistency of the responses with respect to a judge: if responses contain stable, item-specific information, the judge should be able to reliably tell whether two responses come from the same underlying item. Concretely, a typical pattern is:
This is the setting used in Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, where TVD-MI is used both to reason about agent information content and to evaluate judges themselves (e.g. robustness to gaming attacks). The intent here is to make this reliability style of evaluation easy to plug into lighteval benchmarks that already have multiple responses per item. If it would be helpful, I’m happy to follow up with a small example benchmark config (e.g. a synthetic or toy dataset) that uses
That makes sense, and I’ve now added one to this PR and updated the description. I defined this in |
|
Looking great 😎 |
Add
tvd_mimetric (LLM-as-a-judge, corpus-level)Summary
Introduce a new corpus-level metric
tvd_miinto lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.Implementation
Sample-level judge
JudgeLLMTVDMI(subclass ofJudgeLLM) configured withgpt-4o-2024-08-06via theopenaibackend.get_judge_prompt_tvdmi(response_a, response_b, ...)which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.process_judge_response_tvdmi(...)to map responses → binary predictions:A→1,B→0; case/whitespace normalized; unknown → fallback0with warning.Corpus-level aggregation
CorpusLevelTVDMI, which accepts sample-dicts of the form{ "label": 0 or 1, "pred": 0 or 1, … }.where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).
NaN.Metric registration
Metricsenum with:metric_name = "tvd_mi"sample_level_fn = JudgeLLMTVDMI()corpus_level_fn = CorpusLevelTVDMI()category = SamplingMethod.GENERATIVEhigher_is_better = TrueInspect-AI compatible scorer
To make this usable in Inspect-AI workflows, this PR also adds an Inspect-compatible scorer:
"A"or"B").process_judge_response_tvdmi) to map the completion into a binary label0/1.target.text(e.g."A","B","same","different","1","0").Score(value="C" | "I")so Inspect can aggregate withaccuracy()andstderr().This gives a consistent TVD-MI-style classification task that can be run both through lighteval (with a fixed judge) and through Inspect (any model as judge).
Tests
New file:
tests/unit/metrics/test_tvd_mi.py, covering:NaNDocumentation
Updates metric list (LLM-as-Judge section) with:
Usage
To enable the metric in a task config:
Assumes the task formatter yields docs with:
response_a: strresponse_b: strpair_label: int(1or0)Validation
pytest tests/unit/metrics/test_tvd_mi.py -q pytest tests/unit/metrics -q \ --ignore=tests/unit/metrics/test_metric_requests.py \ -k "not extractiveness"Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.