Skip to content

h2xlab/ZeroVO

Repository files navigation

CC BY-NC-SA 4.0

ZeroVO: Visual Odometry with Minimal Assumptions

This repository contains the code that accompanies our CVPR 2025 paper ZeroVO: Visual Odometry with Minimal Assumptions. Please find our project page for more details.

Example 1

Overview

We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models' ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks--KITTI, nuScenes, and Argoverse 2--as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.

Datasets

We use KITTI, Argoverse 2 and nuScenes dataset along with in-the-wild YouTube videos. Please find their websites for dataset setup.

Datasets Download Link
KITTI The KITTI dataset can be downloaded from the official source here. All other datasets, after processing, will adhere to the same directory structure as the KITTI dataset.
Argoverse 2 The Argoverse 2 dataset can be downloaded from the official source here. Once downloaded, the subset corresponding to the VO task can be extracted using the provided script located in the data directory.
nuScenes The nuScenes dataset can be downloaded from the official source here. Once downloaded, the subset corresponding to the VO task can be extracted using the provided script located in the data directory.
GTA V The GTA dataset can be downloaded here. Once downloaded, please extract the contents by running the following command: unzip GTA.zip
YouTube Approximately 50 hours of driving footage were selected from videos published on the YouTube channel J Utah, featuring a diverse range of driving scenarios. A more comprehensive list of driving videos from YouTube can be found here.

The directory structure within the data folder is organized as follows:

data/ 
├── KITTI/
│   ├── kitti_est_intrs.json
│   ├── text_feature/ 
│   ├── depth_est_intrs/ 
│   ├── sequences/ 
│   └── poses/
├── Argoverse 2/
│   ├── stereo_front_left_est_intrs.json
│   ├── text_feature/ 
│   ├── depth_est_intrs/ 
│   ├── sequences/ 
│   └── poses/
├── nuScenes/
│   ├── cam_front_est_intrs.json
│   ├── text_feature/ 
│   ├── depth_est_intrs/ 
│   ├── sequences/ 
│   └── poses/
├── GTA/
│   ├── est_intrs.json
│   ├── text_feature/ 
│   ├── GTA_Depth_est_intrs/ 
│   ├── sequences/ 
│   └── poses/

The estimated camera intrinsics, metric depth, and text features are available for download here. Alternatively, users may regenerate these components using WildCamera for intrinsics estimation, Metric3Dv2 for metric depth prediction, and LLaVA-NeXT and SentenceTransformers for text feature extraction.

Environment Requirements and Installation

# create a new environment
conda create -n ZVO python=3.9
conda activate ZVO
# install pytorch
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c iopath iopath
# install pytorch3d
wget https://anaconda.org/pytorch3d/pytorch3d/0.7.5/download/linux-64/pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
conda install pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
sudo rm pytorch3d-0.7.5-py39_cu117_pyt201.tar.bz2
# export CUDA 11.7 
export CUDA_HOME=/usr/local/cuda-11.7
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip install PyYAML==6.0.2 timm==1.0.16 matplotlib==3.5.3 pandas==2.3.0 opencv-python==4.11.0.86 a-unet==0.0.16 mmcv-full==1.7.2 numpy==1.26.4 pillow==11.0.0 av2==0.2.1 nuscenes-devkit==1.1.11

sudo mv /ZeroVO/vision_transformer_cross.py /home/user/anaconda3/envs/XVO/lib/python3.9/site-packages/timm/models

Training

  1. Install the correlation package
    The correlation package must be installed first:

    cd model/correlation_package
    python setup.py install
    
  2. Preprocess the dataset
    The labels are available in the poses directory. To regenerate the labels or review the corresponding implementation details, please refer to the code and execute the following command:

    python3 preprocess.py
    
  3. Download initial weights

    Download initial weights to init_weights directory. Initial weights can be found here.

  4. Run training

    Supervised Training on nuScenes OneNorth:

    # update params.py
    self.train_video = {'NUSC': nusc_scene_map['singapore-onenorth'],}
    self.checkpoint_path = 'saved_models/zvo_nusc_sl'
    

    Self-Training on nuScenes OneNorth and YouTube:

    # update params.py
    self.train_video = {
        'NUSC': nusc_scene_map['singapore-onenorth'],
        'YouTube': [str(i).zfill(2) for i in range(49)],
        }
    self.checkpoint_path = 'saved_models/zvo_nusc_ssl'
    

    and run:

    python3 main.py
    

Test

Dwonload model checkpoints to saved_models directory. Model checkpoints can be found here.

We test on the KITTI, Argoverse 2, the unseen regions in nuScenes, and GTA:

# update test_utils.py
args.model_path = "/saved_models/ZVO"
gta_scenes = sorted(os.listdir(args.data_path['GTA']+'/sequences/'))
# args.testing_data = {'ARGO2_Stereo': {'ARGO2_Stereo': [str(i).zfill(3) for i in range(1000) if i not in argo2_stereo_remove]}}
# args.testing_data = {'NUSC_X': {'NUSC_X': nusc_scene_map['boston-seaport']+nusc_scene_map['singapore-queenstown']+nusc_scene_map['singapore-hollandvillage']}}
# args.testing_data = {'GTA': {'GTA': gta_scenes}}
# args.testing_data = {'KITTI': {'KITTI': ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10']}}
fast_test(args)

and run:

python3 test_utils.py

Evaluation

cd odom-eval
# update eval.py
eval_dirs = ['ZVO']

and run:

python3 eval.py

VO evaluation tool is revised from https://github.com/Huangying-Zhan/kitti-odom-eval.

Inference

If you would like to use a trained model to generate predictions on new input data, we provide an inference.py script to facilitate the inference process.

Result

We show trajectory prediction results across the four most complex driving sequences (00, 02, 05, and 08) from the KITTI dataset. Each subplot illustrates the trajectories generated by our proposed model and the baseline models alongside the ground truth trajectory. The qualitative results demonstrate that our approach achieves the highest alignment with the ground truth, particularly in challenging turns and extended straight paths. These findings highlight the robustness of our method in handling complex and diverse driving scenarios.

Demo 1

Contact

If you have any questions or comments, please feel free to contact me at leilai@bu.edu.

License

Our work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

ZeroVO: Visual Odometry with Minimal Assumptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •