Skip to content

Conversation

@zrobertson466920
Copy link

@zrobertson466920 zrobertson466920 commented Nov 22, 2025

Add tvd_mi metric (LLM-as-a-judge, corpus-level)

Summary
Introduce a new corpus-level metric tvd_mi into lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.

Implementation

Sample-level judge

  • Adds JudgeLLMTVDMI (subclass of JudgeLLM) configured with gpt-4o-2024-08-06 via the openai backend.
  • Implements prompt generation via get_judge_prompt_tvdmi(response_a, response_b, ...) which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.
  • Adds process_judge_response_tvdmi(...) to map responses → binary predictions: A1, B0; case/whitespace normalized; unknown → fallback 0 with warning.

Corpus-level aggregation

  • Adds CorpusLevelTVDMI, which accepts sample-dicts of the form { "label": 0 or 1, "pred": 0 or 1, … }.
  • Computes:

TVD_MI = TPR + TNR − 1

where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).

  • If either class is missing (no label=1 or no label=0) → returns NaN.

Metric registration

  • Extends Metrics enum with:
  • metric_name = "tvd_mi"
  • sample_level_fn = JudgeLLMTVDMI()
  • corpus_level_fn = CorpusLevelTVDMI()
  • category = SamplingMethod.GENERATIVE
  • higher_is_better = True

Inspect-AI compatible scorer

To make this usable in Inspect-AI workflows, this PR also adds an Inspect-compatible scorer:

@scorer(metrics=[accuracy(), stderr()])
def tvd_mi_scorer():
    ...
  • The scorer expects the model under evaluation to receive a TVD-MI-style prompt (two responses A/B and instructions to answer with "A" or "B").
  • It uses the same normalization logic as the lighteval judge (process_judge_response_tvdmi) to map the completion into a binary label 0/1.
  • It compares this prediction against the gold label carried in target.text (e.g. "A", "B", "same", "different", "1", "0").
  • It returns Score(value="C" | "I") so Inspect can aggregate with accuracy() and stderr().

This gives a consistent TVD-MI-style classification task that can be run both through lighteval (with a fixed judge) and through Inspect (any model as judge).

Tests

New file: tests/unit/metrics/test_tvd_mi.py, covering:

  • Prompt injection & structure checks
  • Response parser normalization and mapping tests
  • Corpus-level correctness: perfect critic → ~1.0, random critic → ~0.0, missing-class → NaN
  • Judge computation wiring via monkey-patching (no actual API calls) verifying keys & labels
  • Additional tests check the inspect scorer produces the right labels for matching and mismatched labels

Documentation

Updates metric list (LLM-as-Judge section) with:

tvd_mi: Corpus-level LLM-as-a-judge metric that estimates a lower bound on total variation mutual information using paired responses. Assumes each example has two responses and a binary label (1 = same item, 0 = different), and computes TPR + TNR − 1.

Usage

To enable the metric in a task config:

metrics:
- name: tvd_mi

Assumes the task formatter yields docs with:

  • response_a: str
  • response_b: str
  • pair_label: int (1 or 0)

Validation

  1. Unit tests passed:
pytest tests/unit/metrics/test_tvd_mi.py -q
pytest tests/unit/metrics -q \
    --ignore=tests/unit/metrics/test_metric_requests.py \
    -k "not extractiveness"
  1. Manual smoke test performed locally with synthetic pairs and live judge: sample outputs and corpus result behaved as expected.

Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.

return [{"role": "user", "content": content}]


def process_judge_response_tvdmi(response: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def process_judge_response_tvdmi(response: str) -> int:
def process_judge_response_tvdmi(response: str | None) -> int:

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB
Copy link
Member

hey @zrobertson466920 thanks for the PR !! Do you have an example of a benchmark that would use this / a use case ?

Also in a effort of standardisation we are trying to have inspect-ai compatible metrics and scorer, would you mind also having an inspect-ai compatible metric ? That way we will be able to use for many more tasks many evaluation tools and codebase !

@zrobertson466920
Copy link
Author

@NathanHB Thanks a lot for the review and the suggestions!

Do you have an example of a benchmark that would use this / a use case ?

Yes, the main use case is label-free, information-oriented evaluation on paired responses, especially when we have multiple model outputs per item but no ground-truth reference. At a high level, TVD-MI provides a symmetry-aware way to measure the reliability and self-consistency of the responses with respect to a judge: if responses contain stable, item-specific information, the judge should be able to reliably tell whether two responses come from the same underlying item.

Concretely, a typical pattern is:

  • start from an existing generative benchmark (e.g. summarization, long-form QA, or a reasoning/analysis task),
  • collect multiple responses per item (from different models, or different decoding strategies / agents),
  • build pairs (response_a, response_b) and label them as:
    • 1 if they come from the same underlying item/task (or the same evidence),
    • 0 if they come from different items/tasks,
  • run tvd_mi to measure how well a critic can distinguish the induced response distributions, which in turn gives a lower bound on the total variation mutual information.

This is the setting used in Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes, where TVD-MI is used both to reason about agent information content and to evaluate judges themselves (e.g. robustness to gaming attacks). The intent here is to make this reliability style of evaluation easy to plug into lighteval benchmarks that already have multiple responses per item.

If it would be helpful, I’m happy to follow up with a small example benchmark config (e.g. a synthetic or toy dataset) that uses tvd_mi end-to-end.

Also in a effort of standardisation we are trying to have inspect-ai compatible metrics and scorer, would you mind also having an inspect-ai compatible metric ?

That makes sense, and I’ve now added one to this PR and updated the description. I defined this in metrics.py and use the same normalization logic as the lighteval TVD-MI judge to interpret outputs and compare against predictions. Happy to adjust naming or placement of the scorer if you have a preferred pattern.

@NathanHB
Copy link
Member

NathanHB commented Dec 4, 2025

Looking great 😎
Would love to have a benchmark that goes with it though ! As it seems a very interesting metric, having an example so that the community can look at / build upon would be nice.
Also you will probably need to run ruff format with the correct version for the tests to pass :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants