A lightweight library for validating lerobot dataset metadata and annotations, and computing GCP upload paths.
📋 Design Documentation: For complete specifications and design details, see the Custom Data Schema Design Doc
pip install -r requirements.txtgcloud auth login
gcloud auth application-default login# Validate training/teleop data
python validate.py validate \
--dataset-path ./my-dataset \
--data-type teleop
# Validate evaluation data
python validate.py validate \
--dataset-path ./my-dataset \
--data-type evalpython validate.py compute-path \
--dataset-path ./my-dataset \
--dataset-name my-robot-data \
--bucket-name my-gcs-bucket \
--data-type eval- ✅ Validates
custom_metadata.csvwith required columns - ✅ Validates
custom_annotation.jsonstructure (optional file) - ✅ Checks lerobot dataset for fps
- ✅ Cross-validates: intervention only for eval episodes, time boundaries
- ✅ Computes GCP upload paths with custom prefixes
- ✅ Supports both local paths and GCP URIs (gs://)
- ✅ Two separate CLI commands:
validateandcompute-path - ✅ Type-safe CLI using tyro
Your dataset should have:
my-dataset/
└── meta/
├── info.json # Must contain "fps"
├── custom_metadata.csv # Episode metadata (required)
└── custom_annotation.json # Episode annotations (optional)
Must contain:
fps: Data collection frequency (frames per second)
Example:
{
"fps": 30,
"robot_type": "manipulator"
}Note: In lerobot datasets, info.json is stored in the meta/ folder.
Must have exactly these columns:
| Column | Type | Description |
|---|---|---|
episode_index |
int | Episode number |
operator_id |
string | Operator identifier |
is_eval_episode |
boolean | True for eval, False for training |
episode_id |
string | Unique episode identifier |
start_timestamp |
float | UTC seconds (Unix epoch time) |
checkpoint_path |
string | GCS URI (only for eval episodes) |
success |
boolean | Whether episode was successful |
station_id |
string | Station/scene identifier |
robot_id |
string | Robot hardware identifier |
Example:
episode_index,operator_id,is_eval_episode,episode_id,start_timestamp,checkpoint_path,success,station_id,robot_id
0,operator_alice,True,ep_001,1730455200,gs://my-bucket/checkpoints/policy_v1.0.pth,True,station_01,robot_alpha
1,operator_bob,False,ep_002,1730458800,,True,station_01,robot_alpha
Important Notes:
start_timestampmust be UTC seconds (Unix epoch time), not ISO format- ✅ Valid:
1730455200 - ❌ Invalid:
2024-11-01T10:00:00
- ✅ Valid:
checkpoint_pathshould only be set for eval episodes (is_eval_episode=True)checkpoint_pathmust be a valid GCS URI format:gs://bucket/path/to/checkpoint
See examples/example_dataset/meta/custom_metadata.csv for a complete example.
Must follow this structure:
{
"episodes": [
{
"episode_id": "ep_001",
"spans": [
{"start_time": 0.0, "end_time": 5.0, "label": "grasp"},
{"start_time": 2.0, "end_time": 3.0, "label": "human_intervention"}
],
"extras": {"notes": "optional metadata"}
}
]
}spans: List of time-based annotations with start_time, end_time, and labelstart_timeandend_timeare relative seconds from episode start (just like timestamps in LeRobot data)- Use label
"human_intervention"for human interventions during policy rollout
extras: Free-form metadata (optional)
Note: This file is optional. If missing, validation will still pass.
See examples/example_dataset/meta/custom_annotation.json for a complete example.
The validator provides two separate commands:
Validates dataset metadata and annotations without computing upload paths.
python validate.py validate \
--dataset-path PATH \
--data-type TYPEArguments:
--dataset-path: Path to dataset directory (local or GCP URI like gs://bucket/path)--data-type: Eitherteleop(training) oreval(evaluation)
Examples:
# Validate local training data
python validate.py validate --dataset-path ./my-dataset --data-type teleop
# Validate GCP evaluation data
python validate.py validate --dataset-path gs://my-bucket/datasets/my-dataset --data-type evalComputes the GCP upload path for a dataset (validates by default).
python validate.py compute-path \
--dataset-path PATH \
--dataset-name NAME \
--bucket-name BUCKET \
--data-type TYPE \
[--dataset-version VERSION] \
[--custom-folder-prefix PREFIX] \
[--skip-validation]Arguments:
--dataset-path: Path to dataset directory (required)--dataset-name: Dataset name for GCP path (required)--bucket-name: GCS bucket name (required)--data-type: Eitherteleoporeval(required)--dataset-version: Version string (optional, default: timestamp)--custom-folder-prefix: Custom folder prefix (optional, e.g., "experiments/phase-1")--skip-validation: Skip validation, only compute path (optional)
Examples:
# Compute path with validation
python validate.py compute-path \
--dataset-path ./my-dataset \
--dataset-name robot-manipulation \
--bucket-name production-data \
--data-type eval
# With custom version and prefix
python validate.py compute-path \
--dataset-path gs://source-bucket/datasets/my-dataset \
--dataset-name robot-manipulation \
--bucket-name target-bucket \
--data-type teleop \
--dataset-version v2.1.0 \
--custom-folder-prefix experiments/phase-1
# Skip validation (faster, but not recommended)
python validate.py compute-path \
--dataset-path ./my-dataset \
--dataset-name my-data \
--bucket-name my-bucket \
--data-type eval \
--skip-validationpython validate.py --help
python validate.py validate --help
python validate.py compute-path --helpThe validator uses two data types:
| Data Type | Description | is_eval_episode |
|---|---|---|
teleop |
Training/teleoperation data | False |
eval |
Evaluation/policy rollout data | True |
Important: All episodes in a dataset must have matching is_eval_episode values that correspond to the specified data type.
- All required columns must be present
- No extra columns allowed
episode_idmust be uniqueis_eval_episodeandsuccessmust be booleanstart_timestampmust be UTC seconds (Unix epoch time) in range 2000-2100checkpoint_pathmust be a valid GCS URI (gs://bucket/path) when specifiedcheckpoint_pathshould only be set for eval episodes (is_eval_episode=True)
- Must follow the required schema structure
spanswithstart_time < end_time(timestamps are relative seconds from episode start)- No negative time values allowed
- Proper JSON structure required
- Must contain
fpsfield ininfo.json
- Human intervention constraint: Spans with label
"human_intervention"only allowed for eval episodes (is_eval_episode=True) - Time boundary constraint: All span times must be ≤ episode duration
- Data type consistency:
--data-type teleop: All episodes must have is_eval_episode=False--data-type eval: All episodes must have is_eval_episode=True
- Checkpoint path constraint: checkpoint_path should not be specified for non-eval episodes
The computed GCP path follows this format:
gs://bucket/[custom_prefix/]dataset/version/data_type/
Examples:
- Eval data:
gs://my-bucket/dataset/v1.0/eval/ - Teleop data:
gs://my-bucket/dataset/v1.0/teleop/ - With prefix:
gs://my-bucket/experiments/batch-1/dataset/v1.0/eval/
The validator supports both local filesystem paths and GCP URIs:
# Local path
python validate.py validate --dataset-path ./my-dataset --data-type eval
# GCP URI
python validate.py validate --dataset-path gs://my-bucket/datasets/my-dataset --data-type evalAll file operations work transparently with both path types.
from pathlib import Path
from cloudpathlib import AnyPath
from lerobot_validator import LerobotDatasetValidator, compute_gcp_path
# Validate (expects files in dataset/meta/ folder)
# Supports both local Path and CloudPath
dataset_path = AnyPath("./dataset") # or gs://bucket/dataset
validator = LerobotDatasetValidator(
dataset_path=dataset_path,
is_eval_data=True, # True for eval, False for teleop
)
if validator.validate():
print("✓ Validation passed!")
# Compute GCP path
gcp_path = compute_gcp_path(
dataset_name="my-dataset",
bucket_name="my-gcs-bucket",
data_type="eval", # "teleop" or "eval"
custom_folder_prefix="experiments/run-1", # Optional
)
print(f"Upload to: {gcp_path}")
else:
for error in validator.get_errors():
print(f"Error: {error}")"Missing required columns in metadata CSV"
- Add all required columns: episode_index, operator_id, is_eval_episode, episode_id, start_timestamp, checkpoint_path, success, station_id, robot_id
"Unexpected columns found"
- Remove extra columns not in the required list
"Column 'start_timestamp' must contain valid UTC timestamps in seconds"
- Use Unix epoch time (e.g., 1730455200), not ISO format (2024-11-01T10:00:00)
- Valid range: Year 2000 to 2100
"Episode has human_intervention span but is_eval_episode=False"
- Human interventions only allowed for eval episodes
- Either set is_eval_episode=True or remove the intervention spans
"checkpoint_path should not be specified for non-eval episodes"
- Only set checkpoint_path for eval episodes (is_eval_episode=True)
- Leave it empty for training/teleop episodes
"Intervention time exceeds episode duration"
- Check that all span end_time values are within episode length
"No task string found in lerobot dataset"
- Ensure info.json contains "task" field
"Missing 'fps' field in info.json"
- Add "fps" field to info.json to specify data collection frequency
"path_to_policy_checkpoint must contain valid GCS URIs"
- Use format
gs://bucket/path/to/checkpoint.pth - Make sure URIs start with
gs:// - Include both bucket name and path
"Dataset is marked as eval/teleop data but episodes have mismatched is_eval_episode values"
- Eval data: All episodes should have is_eval_episode=True
- Teleop data: All episodes should have is_eval_episode=False
Complete example datasets are provided:
demo/sample_dataset/- Working demo with 3 episodesexamples/example_dataset/- Reference implementation
To run the demo:
python validate.py validate \
--dataset-path demo/sample_dataset \
--data-type evalpip install -r requirements-dev.txt
pytest tests/black lerobot_validator/ tests/
isort lerobot_validator/ tests/
mypy lerobot_validator/pi-data-sharing/
├── validate.py # Main entry point
├── lerobot_validator/ # Core library
│ ├── cli.py # CLI commands (validate, compute-path)
│ ├── gcp_path.py # GCP path computation
│ ├── metadata_validator.py
│ ├── annotation_validator.py
│ ├── lerobot_checks.py
│ ├── validator.py
│ └── schemas.py
├── tests/ # Test suite
├── examples/ # Example CSV and JSON files
├── demo/ # Working demo
├── TIMESTAMP_VALIDATION.md # Timestamp format guide
└── README.md # This file
The start_timestamp field must be in UTC seconds (Unix epoch time):
from datetime import datetime
# Convert ISO to UTC seconds
iso_time = "2024-11-01T14:00:00"
utc_seconds = int(datetime.fromisoformat(iso_time).timestamp())
print(utc_seconds) # 1730469600See TIMESTAMP_VALIDATION.md for detailed information and conversion examples.
Apache-2.0
- Design Documentation: Custom Data Schema Design Doc
- Examples: See
examples/anddemo/directories - Tests: Check
tests/for usage patterns - Issues: See troubleshooting section above