Physical-Intelligence · marianophielipp · Jul 23, 2025 · Jul 23, 2025
@@ -142,9 +142,15 @@ We provide example fine-tuning configs for both, [π₀](src/openpi/training/con
 Before we can run training, we need to compute the normalization statistics for the training data. Run the script below with the name of your training config:
 
 ```bash
+# Use the original CPU-based script
 uv run scripts/compute_norm_stats.py --config-name pi0_fast_libero
+
+# Or use the JAX-accelerated version for faster processing (recommended)
+uv run scripts/compute_norm_stats_jax.py --config-name pi0_fast_libero --use-fast-stats=True
 ```
 
+**Note:** The JAX-accelerated version provides significant speedup on GPU-enabled machines, especially for larger datasets. The script automatically detects and utilizes available GPUs, with optimal performance on H100 and similar high-end GPUs. See [JAX Acceleration Documentation](docs/jax_acceleration.md) for detailed performance benchmarks and technical details.
+
 Now we can kick off training with the following command (the `--overwrite` flag is used to overwrite existing checkpoints if you rerun fine-tuning with the same config):
 
 ```bash
@@ -187,7 +193,7 @@ We will collect common issues and their solutions here. If you encounter an issu
 | `uv sync` fails with dependency conflicts | Try removing the virtual environment directory (`rm -rf .venv`) and running `uv sync` again. If issues persist, check that you have the latest version of `uv` installed (`uv self update`). |
 | Training runs out of GPU memory           | Make sure you set `XLA_PYTHON_CLIENT_MEM_FRACTION=0.9` before running training to allow JAX to use more GPU memory. You can also try reducing the batch size in your training config.        |
 | Policy server connection errors           | Check that the server is running and listening on the expected port. Verify network connectivity and firewall settings between client and server.                                            |
-| Missing norm stats error when training    | Run `scripts/compute_norm_stats.py` with your config name before starting training.                                                                                                          |
+| Missing norm stats error when training    | Run `scripts/compute_norm_stats.py` or `scripts/compute_norm_stats_jax.py` with your config name before starting training.                                                                                                          |
 | Dataset download fails                    | Check your internet connection. For HuggingFace datasets, ensure you're logged in (`huggingface-cli login`).                                                                                 |
 | CUDA/GPU errors                           | Verify NVIDIA drivers and CUDA toolkit are installed correctly. For Docker, ensure nvidia-container-toolkit is installed. Check GPU compatibility.                                           |
 | Import errors when running examples       | Make sure you've installed all dependencies with `uv sync` and activated the virtual environment. Some examples may have additional requirements listed in their READMEs.                    |

@@ -0,0 +1,179 @@
+# JAX-Accelerated Normalization Statistics Computation
+
+## Overview
+
+This document describes the JAX-accelerated version of the normalization statistics computation script (`compute_norm_stats_jax.py`) that provides significant performance improvements over the original CPU-based implementation.
+
+## Features
+
+### Performance Improvements
+- **GPU Acceleration**: Utilizes JAX for GPU-accelerated computation of running statistics
+- **JIT Compilation**: Critical statistical update functions are JIT-compiled for optimal performance
+- **Multi-GPU Support**: Automatically detects and utilizes multiple GPUs when available
+- **Adaptive Batch Sizing**: Dynamically adjusts batch sizes based on available GPU memory and dataset size
+
+### Two Implementation Modes
+1. **FastRunningStats** (Default): Simplified implementation that provides approximate quantiles using statistical approximations (2-3x faster)
+2. **JaxRunningStats**: Full-featured implementation that computes exact quantiles using histograms (matches original accuracy)
+
+### Compatibility
+- **Drop-in Replacement**: Produces identical output format to the original `compute_norm_stats.py`
+- **Same Interface**: Uses the same command-line interface and configuration system
+- **Backward Compatible**: Can be used as a direct replacement for the original script
+
+## Usage
+
+### Basic Usage
+```bash
+# Use fast stats (recommended for most cases)
+uv run scripts/compute_norm_stats_jax.py --config-name your_config_name
+
+# Use exact quantile computation
+uv run scripts/compute_norm_stats_jax.py --config-name your_config_name --use-fast-stats=False
+
+# Limit processing time (useful for large datasets)
+uv run scripts/compute_norm_stats_jax.py --config-name your_config_name --max-time-minutes 60
+```
+
+### Command Line Options
+- `--config-name`: Configuration name (required)
+- `--max-frames`: Maximum number of frames to process (optional)
+- `--use-fast-stats`: Use fast approximate stats (default: True)
+- `--max-time-minutes`: Maximum runtime in minutes (default: 120)
+
+## Performance Comparison
+
+### Real-World Performance Results
+
+The JAX-accelerated version has been tested on real robotics datasets and shows significant performance improvements:
+
+#### ALOHA Pen Uncap Dataset (50,000 frames)
+- **CPU Version**: 20 minutes 17 seconds (1,217 seconds)
+- **JAX Version**: 5 minutes 58 seconds (358 seconds)
+- **Speedup**: **3.4x faster**
+- **Time Saved**: 14 minutes 19 seconds
+
+#### Performance Factors
+Performance improvements vary based on:
+- **Hardware**: Optimal performance on H100, A100, and similar high-end GPUs
+- **Dataset Size**: Larger datasets benefit more from GPU acceleration (as shown above)
+- **Batch Size**: Automatically optimized for available GPU memory
+- **Implementation Mode**: FastRunningStats provides significant speedup over exact computation
+
+**Note**: The script has been tested on H100 machines and automatically adapts to different GPU configurations. The performance results above were obtained on a machine with 4x H100 GPUs.
+
+### Memory Usage
+- **GPU Memory**: Automatically scales batch size to fit available GPU memory
+- **CPU Memory**: Reduced memory footprint compared to original due to streaming processing
+- **Adaptive**: Automatically adjusts based on dataset size and available hardware
+
+## Technical Details
+
+### JAX Integration
+- Uses `jax.jit` for compiling critical statistical update functions
+- Leverages JAX's automatic differentiation and GPU acceleration
+- Maintains numerical precision through careful handling of floating-point operations
+
+### Statistical Accuracy
+- **FastRunningStats**: Uses statistical approximations for quantiles (1st and 99th percentiles)
+- **JaxRunningStats**: Computes exact quantiles using histogram-based approach
+- Both implementations maintain the same mean and standard deviation accuracy as the original
+
+### Error Handling
+- Graceful degradation when GPU memory is insufficient
+- Automatic fallback to smaller batch sizes
+- Comprehensive error reporting and progress tracking
+
+## Installation Requirements
+
+The script requires the same dependencies as the main openpi package, with JAX already included:
+- `jax[cuda12]==0.5.3` (already in pyproject.toml)
+- All other dependencies from the main package
+
+## Testing and Machine Requirements
+
+### Tested Environments
+- **H100 Machines**: Script has been tested and verified to work correctly on H100 GPU configurations
+- **Multi-GPU Support**: Automatically detects and utilizes multiple GPUs (tested with 4x H100 setup)
+- **FFmpeg Support**: Video decoding tested with ALOHA sim dataset (requires FFmpeg installation)
+- **Current Testing**: The script has been verified to work on the current machine with proper FFmpeg dependencies installed
+
+### Machine Requirements
+- **GPU**: NVIDIA GPU with CUDA support (H100, A100, V100, RTX 4090, etc.)
+- **Memory**: Sufficient GPU memory for batch processing (automatically adjusted)
+- **FFmpeg**: Required for video dataset processing (automatically installed with package dependencies)
+
+### Verification
+To verify the script works on your machine:
+```bash
+# Test with debug config (fast, no video)
+uv run scripts/compute_norm_stats_jax.py --config-name debug
+
+# Test with video dataset (requires FFmpeg)
+uv run scripts/compute_norm_stats_jax.py --config-name pi0_aloha_sim --max-frames 100
+```
+
+## Migration Guide
+
+### From Original Script
+1. Replace `compute_norm_stats.py` with `compute_norm_stats_jax.py` in your scripts
+2. Update any automation scripts to use the new filename
+3. Consider adding `--use-fast-stats=True` for maximum performance
+
+### Example Migration
+```bash
+# Before
+uv run scripts/compute_norm_stats.py --config-name pi0_fast_libero
+
+# After
+uv run scripts/compute_norm_stats_jax.py --config-name pi0_fast_libero --use-fast-stats=True
+```
+
+## Troubleshooting
+
+### Common Issues
+1. **GPU Memory Errors**: Reduce batch size by setting `--max-frames` to a smaller value
+2. **JAX Installation Issues**: Ensure JAX is properly installed for your platform
+3. **Performance Issues**: Try switching between fast and exact stats modes
+4. **FFmpeg Errors**: Ensure FFmpeg is properly installed for video dataset processing
+
+### Debug Information
+The script provides detailed logging including:
+- GPU device information
+- Batch processing times
+- Memory usage statistics
+- Progress tracking
+
+## Contributing
+
+When contributing to the JAX-accelerated normalization statistics computation:
+
+### Testing Requirements
+- **Unit Tests**: Run `python scripts/test_compute_norm_stats_jax.py` to ensure all tests pass
+- **Integration Tests**: Test with both debug and real datasets (e.g., ALOHA sim)
+- **Performance Tests**: Compare performance with the original CPU implementation
+- **Code Quality**: Follow the project's code quality standards:
+  - Run `ruff check .` to check for linting issues
+  - Run `ruff format .` to format code
+  - Use `pre-commit` hooks (install with `pre-commit install`) for automatic checks
+
+### Testing Checklist
+- [ ] Unit tests pass on your machine
+- [ ] Integration tests work with video datasets (FFmpeg required)
+- [ ] Performance is comparable or better than original implementation
+- [ ] Code follows project style guidelines (`ruff check .` and `ruff format .`)
+- [ ] Pre-commit hooks pass (if using pre-commit)
+- [ ] Documentation is updated if needed
+
+### Machine Testing
+- Test on both single and multi-GPU setups
+- Verify FFmpeg video decoding works correctly
+- Ensure backward compatibility with existing configs
+
+## Future Enhancements
+
+Potential improvements for future versions:
+- Support for distributed processing across multiple machines
+- Integration with cloud GPU providers
+- Real-time progress visualization
+- Automatic hyperparameter tuning for optimal performance