Skip to content

AvgMinMax median approximation is inconsistent #360

@fsiino-nvidia

Description

@fsiino-nvidia

Describe the bug

The median value in dataset metrics (train_data_utils.py) produces different results on each run, even with identical input data. This causes validation failures when comparing metrics files. The _validate_aggregate_metrics function detects differences in the median field and raises a ValueError about conflicting aggregate metrics.

Steps/Code to reproduce bug

Run data preparation on any dataset, e.g.:

config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/library_judge_math/configs/bytedtsinghua_dapo17k.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" \
    +output_dirpath=data/bytedtsinghua_dapo17k \
    +mode=train_preparation +should_download=true

This may or may not produce a ValueError about conflicting aggregate metrics:

Differences found in aggregate metrics:
[
    'Numeric mismatch at {field_name}.Median: 80.33 != 80.44'
]

...

Found conflicting aggregate metrics that need to be corrected:
- resources_servers/math_with_judge/data/dapo17k_train_metrics_conflict.json
- resources_servers/math_with_judge/data/dapo17k_validation_metrics_conflict.json

This could be due to a change in how metrics are calculated, leading to outdated metrics. Try deleting the below file(s) and rerunning data preparation:
- resources_servers/math_with_judge/data/dapo17k_train_metrics.json
- resources_servers/math_with_judge/data/dapo17k_validation_metrics.json

Expected behavior

Metrics should be deterministic. Running data preparation multiple times on the same dataset should produce identical metrics, including the median. The validation check should pass when re-running with unchanged data.

Configs
Any dataset configuration.

Environment details

Otherwise, please provide:
N/A

Additional context

The AvgMinMax class uses TDigest for median estimation. This is an approximation of the median, and is not guaranteed to be exactly the same on each run.

Metadata

Metadata

Assignees

Labels

core-infraHelpful infrastructure

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions