Yume: An Interactive World Generation Model

🔥🔥🔥 News!!

Dec 26, 2025: 🔥 We have released the Yume-5B model. Additionally, we have released the Yume-1.5 paper, which introduces a new interactive world foundation model.
July 23, 2025: 🔥 We released Yume-1.0. The first fully open-source real-world world model (including data, training/inference code, and weights).

Yume: An Interactive World Generation Model

Yume is a long-term project that aims to create an interactive, realistic, and dynamic world through the input of text, images, or videos.

A distillation recipes for video DiT.
FramePack-Like training code.
Long video generation method with DDP/FSDP sampling support

🔧 Installation

The code is tested on Python 3.10.0, CUDA 12.1 and A100.

./env_setup.sh fastvideo
pip install -r requirements.txt

You need to run pip install . after each code modification, or alternatively, you can copy the modified files directly into your virtual environment. For example, if I modified wan/image2video.py and my virtual environment is yume, I can copy the file to: envs/yume/lib/python3.10/site-packages/wan/image2video.py.

📦 Windows One-Click Install & Run

To facilitate the use and testing of Yume-5B, we provide a one-click solution for Windows to launch the Web Demo. Simply run run_oneclick_debug.bat and open the displayed URL in your browser.

This program has been successfully tested on an RTX 4090 Laptop GPU (16GB). We recommend using a GPU with at least 16GB VRAM. Please adjust the sampling steps between 4 and 50 based on your GPU performance. Note that higher steps yield better quality but will result in slower generation speeds.

📹 Demo

🚀 Inference

ODE

For image-to-video generation, we use --jpg_dir="./jpg" to specify the input image directory and --caption_path="./caption.txt" to provide text conditioning inputs, where each line corresponds to a generation instance controlling 2-second video output.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_jpg.sh

We also consider generating videos using the data from ./val, where --test_data_dir="./val" specifies the location of the example data.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample.sh

SDE

We perform TTS sampling, where args.sde controls whether to use SDE-based sampling.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_tts.sh

For optimal results, we recommend keeping Actual distance, Angular change rate (turn speed), and View rotation speed within the range of 0.1 to 10.

Key adjustment guidelines:

When executing Camera remains still (·), reduce the Actual distance value
When executing Person stands still, decrease both Angular change rate and View rotation speed values

Note that these parameters (Actual distance, Angular change rate, and View rotation speed) do impact generation results. As an alternative approach, you may consider removing these parameters entirely for simplified operation.

5B

We perform sampling using the 5B model. First, download the weights from Hugging Face and place them in the current directory under ./Yume-5B-720p. args.T2V controls whether the model operates in text-to-video mode, and args.prompt specifies the input caption.

# Download the model weights and place them in Path_To_Yume.
bash scripts/inference/sample_tts.sh

🎯 Training & Distill

For model training, we use args.MVDT to launch the MVDT framework, which requires at least 16 A100 GPUs. Loading T5 onto the CPU may help conserve GPU memory. We employ args.Distil to enable adversarial distillation.

# Download the model weights and place them in Path_To_Yume.
bash scripts/finetune/finetune.sh

🧱 Dataset Preparation

decode_camera_controls_from_c2w_sequence.py converts camera trajectories into keyboard directional controls. Please refer to https://github.com/Lixsp11/sekai-codebase to download the dataset. For the processed data format, refer to ./test_video.

path_to_processed_dataset_folder/
├── Keys_None_Mouse_Down/ 
│   ├── video_id.mp4
│   ├── video_id.txt
├── Keys_None_Mouse_Up
│──  ...
└── Keys_S_Mouse_·

The provided TXT file content record either camera motion control parameters or animation keyframe data, with the following field definitions:

Start Frame: 2 #Starting frame number (begins at frame 2 at origin video)

End Frame: 50 #Ending frame number

Duration: 49 frames #Total duration

Keys: W #Keyboard input

Mouse: ↓ #Mouse action

In scripts/finetune/finetune.sh, args.root_dir represents the path_to_processed_dataset_folder, and args.root_dir represents the full path to the Sekai dataset.

📑 Development Plan

Dataset processing
- Providing processed datasets
Code update
- fp8 support
- Better distillation methods
Model Update
- Quantized and Distilled Models
- Models for 720p Resolution Generation

🤝 Contributing

We welcome all contributions.

Acknowledgement

We learned and reused code from the following projects:

Citation

If you use Yume for your research, please cite our paper:

@article{mao2025yume,
  title={Yume: An Interactive World Generation Model},
  author={Mao, Xiaofeng and Lin, Shaoheng and Li, Zhen and Li, Chuanhao and Peng, Wenshuo and He, Tong and Pang, Jiangmiao and Chi, Mingmin and Qiao, Yu and Zhang, Kaipeng},
  journal={arXiv preprint arXiv:2507.17744},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ADD		ADD
assets		assets
demo		demo
docs		docs
fastvideo.egg-info		fastvideo.egg-info
fastvideo		fastvideo
hyvideo		hyvideo
jpg		jpg
scripts		scripts
test_video		test_video
wan		wan
wan23		wan23
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bootstrap.py		bootstrap.py
caption.txt		caption.txt
caption_re.txt		caption_re.txt
caption_re1.txt		caption_re1.txt
cog.yaml		cog.yaml
decode_camera_controls_from_c2w_sequence.py		decode_camera_controls_from_c2w_sequence.py
env_setup.sh		env_setup.sh
format.sh		format.sh
import_shim.py		import_shim.py
pyproject.toml		pyproject.toml
requirements-extra.txt		requirements-extra.txt
requirements-extra.win.txt		requirements-extra.win.txt
requirements-lint.txt		requirements-lint.txt
requirements.txt		requirements.txt
run_oneclick_debug.bat		run_oneclick_debug.bat
webapp_single_gpu.py		webapp_single_gpu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥🔥🔥 News!!

Yume: An Interactive World Generation Model

🔧 Installation

📦 Windows One-Click Install & Run

📹 Demo

🚀 Inference

ODE

SDE

5B

🎯 Training & Distill

🧱 Dataset Preparation

📑 Development Plan

🤝 Contributing

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

stdstu12/YUME

Folders and files

Latest commit

History

Repository files navigation

🔥🔥🔥 News!!

Yume: An Interactive World Generation Model

🔧 Installation

📦 Windows One-Click Install & Run

📹 Demo

🚀 Inference

ODE

SDE

5B

🎯 Training & Distill

🧱 Dataset Preparation

📑 Development Plan

🤝 Contributing

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages