Feature/tvd mi metric #1080

zrobertson466920 · 2025-11-22T00:27:25Z

Add `tvd_mi` metric (LLM-as-a-judge, corpus-level)

Summary
Introduce a new corpus-level metric tvd_mi into lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.

Implementation

Sample-level judge

Adds JudgeLLMTVDMI (subclass of JudgeLLM) configured with gpt-4o-2024-08-06 via the openai backend.
Implements prompt generation via get_judge_prompt_tvdmi(response_a, response_b, ...) which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.
Adds process_judge_response_tvdmi(...) to map responses → binary predictions: A → 1, B → 0; case/whitespace normalized; unknown → fallback 0 with warning.

Corpus-level aggregation

Adds CorpusLevelTVDMI, which accepts sample-dicts of the form { "label": 0 or 1, "pred": 0 or 1, … }.
Computes:


TVD_MI = TPR + TNR − 1

where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).

If either class is missing (no label=1 or no label=0) → returns NaN.

Metric registration

Extends Metrics enum with:
metric_name = "tvd_mi"
sample_level_fn = JudgeLLMTVDMI()
corpus_level_fn = CorpusLevelTVDMI()
category = SamplingMethod.GENERATIVE
higher_is_better = True

Inspect-AI compatible scorer

To make this usable in Inspect-AI workflows, this PR also adds an Inspect-compatible scorer:

@scorer(metrics=[accuracy(), stderr()])
def tvd_mi_scorer():
    ...

The scorer expects the model under evaluation to receive a TVD-MI-style prompt (two responses A/B and instructions to answer with "A" or "B").
It uses the same normalization logic as the lighteval judge (process_judge_response_tvdmi) to map the completion into a binary label 0/1.
It compares this prediction against the gold label carried in target.text (e.g. "A", "B", "same", "different", "1", "0").
It returns Score(value="C" | "I") so Inspect can aggregate with accuracy() and stderr().

This gives a consistent TVD-MI-style classification task that can be run both through lighteval (with a fixed judge) and through Inspect (any model as judge).

Tests

New file: tests/unit/metrics/test_tvd_mi.py, covering:

Prompt injection & structure checks
Response parser normalization and mapping tests
Corpus-level correctness: perfect critic → ~1.0, random critic → ~0.0, missing-class → NaN
Judge computation wiring via monkey-patching (no actual API calls) verifying keys & labels
Additional tests check the inspect scorer produces the right labels for matching and mismatched labels

Documentation

Updates metric list (LLM-as-Judge section) with:

tvd_mi: Corpus-level LLM-as-a-judge metric that estimates a lower bound on total variation mutual information using paired responses. Assumes each example has two responses and a binary label (1 = same item, 0 = different), and computes TPR + TNR − 1.

Usage

To enable the metric in a task config:

metrics:
- name: tvd_mi

Assumes the task formatter yields docs with:

response_a: str
response_b: str
pair_label: int (1 or 0)

Validation

Unit tests passed:

pytest tests/unit/metrics/test_tvd_mi.py -q
pytest tests/unit/metrics -q \
    --ignore=tests/unit/metrics/test_metric_requests.py \
    -k "not extractiveness"

Manual smoke test performed locally with synthetic pairs and live judge: sample outputs and corpus result behaved as expected.

Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.

NathanHB · 2025-11-24T09:52:21Z

src/lighteval/metrics/utils/judge_utils.py

+    return [{"role": "user", "content": content}]
+
+
+def process_judge_response_tvdmi(response: str) -> int:


Suggested change

def process_judge_response_tvdmi(response: str) -> int:

def process_judge_response_tvdmi(response: str | None) -> int:

HuggingFaceDocBuilderDev · 2025-11-24T09:58:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

NathanHB · 2025-11-24T10:54:41Z

hey @zrobertson466920 thanks for the PR !! Do you have an example of a benchmark that would use this / a use case ?

Also in a effort of standardisation we are trying to have inspect-ai compatible metrics and scorer, would you mind also having an inspect-ai compatible metric ? That way we will be able to use for many more tasks many evaluation tools and codebase !

zrobertson466920 · 2025-11-24T21:45:39Z

@NathanHB Thanks a lot for the review and the suggestions!

Do you have an example of a benchmark that would use this / a use case ?

Yes, the main use case is label-free, information-oriented evaluation on paired responses, especially when we have multiple model outputs per item but no ground-truth reference. At a high level, TVD-MI provides a symmetry-aware way to measure the reliability and self-consistency of the responses with respect to a judge: if responses contain stable, item-specific information, the judge should be able to reliably tell whether two responses come from the same underlying item.

Concretely, a typical pattern is:

start from an existing generative benchmark (e.g. summarization, long-form QA, or a reasoning/analysis task),
collect multiple responses per item (from different models, or different decoding strategies / agents),
build pairs (response_a, response_b) and label them as:
- 1 if they come from the same underlying item/task (or the same evidence),
- 0 if they come from different items/tasks,
run tvd_mi to measure how well a critic can distinguish the induced response distributions, which in turn gives a lower bound on the total variation mutual information.

This is the setting used in Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, where TVD-MI is used both to reason about agent information content and to evaluate judges themselves (e.g. robustness to gaming attacks). The intent here is to make this reliability style of evaluation easy to plug into lighteval benchmarks that already have multiple responses per item.

If it would be helpful, I’m happy to follow up with a small example benchmark config (e.g. a synthetic or toy dataset) that uses tvd_mi end-to-end.

Also in a effort of standardisation we are trying to have inspect-ai compatible metrics and scorer, would you mind also having an inspect-ai compatible metric ?

That makes sense, and I’ve now added one to this PR and updated the description. I defined this in metrics.py and use the same normalization logic as the lighteval TVD-MI judge to interpret outputs and compare against predictions. Happy to adjust naming or placement of the scorer if you have a preferred pattern.

NathanHB · 2025-12-04T15:02:31Z

Looking great 😎
Would love to have a benchmark that goes with it though ! As it seems a very interesting metric, having an example so that the community can look at / build upon would be nice.
Also you will probably need to run ruff format with the correct version for the tests to pass :)

zrobertson466920 added 5 commits November 19, 2025 18:02

add tvd-mi prompt + parser

4db51f7

implement judgellmtvdmi and aggregator

3cb0dfc

corpus aggregator + register metric + sanity test

bfa6e65

add unit-testing + response normalization

81ec63c

Document tvd_mi metric

0febc81

NathanHB reviewed Nov 24, 2025

View reviewed changes

NathanHB added the feature label Nov 24, 2025

Add inspect implementation for tvd-mi metric

5a0c3df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/tvd mi metric #1080

Feature/tvd mi metric #1080

zrobertson466920 commented Nov 22, 2025 •

edited

Loading

Uh oh!

NathanHB Nov 24, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 24, 2025

Uh oh!

NathanHB commented Nov 24, 2025

Uh oh!

zrobertson466920 commented Nov 24, 2025

Uh oh!

NathanHB commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return [{"role": "user", "content": content}]


		def process_judge_response_tvdmi(response: str) -> int:

Feature/tvd mi metric #1080

Are you sure you want to change the base?

Feature/tvd mi metric #1080

Conversation

zrobertson466920 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add tvd_mi metric (LLM-as-a-judge, corpus-level)

Implementation

Inspect-AI compatible scorer

Tests

Documentation

Usage

Validation

Uh oh!

NathanHB Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 24, 2025

Uh oh!

NathanHB commented Nov 24, 2025

Uh oh!

zrobertson466920 commented Nov 24, 2025

Uh oh!

NathanHB commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zrobertson466920 commented Nov 22, 2025 •

edited

Loading

Add `tvd_mi` metric (LLM-as-a-judge, corpus-level)