SEVERE: fix `main_media_paths` and `valid_video_paths` out of sync during filtering, causing condition-video misalignment #54

lenn-arts · 2025-10-08T21:40:07Z

Sharing the fix to a bug that cost me a lot of time to find during fine-tuning. High severity since it makes results of fine-tuning unusable.

Issue

Location:
scripts/preprocess_dataset.py > scripts/process_videos.py > MediaDataset > _filter_valid_videos()

Relevant for:
Finetuning on any dataset that has conditions and some videos that may get filtered out.

TLDR:
MediaDataset filters videos upon instantiation by removing file paths of videos that are, for instance, too short relative to bucket size from self.video_paths. MediaDataset also keeps self.main_media_paths which in many cases is equal to self.video_paths and determines the save filepath of the computed VAE embedding. However, filtering only changes self.video_paths, NOT self.main_media_paths, so that embeddings of kept videos get saved with name of filtered-out videos. This causes kept videos to be loaded with the conditioning of other videos, and the model may get confused by the lack of correlation of conditioning and video.

Simple example:
Raw videos in FT dataset:

a.mp4
b.mp4
c.mp4

Precomputed text condition embeddings:

.precomputed/conditions/a.pt
.precomputed/conditions/b.pt
.precomputed/conditions/c.pt

self.video_paths before filtering = [a.mp4,b.mp4,c.mp4]
self.main_media_paths before filtering = [a.mp4,b.mp4,c.mp4]

self.video_paths AFTER filtering (remove b because too short) = [a.mp4,c.mp4]
self.main_media_paths AFTER filtering (doesn't get changed) = [a.mp4,b.mp4,c.mp4]

MediaDataset.__getitem__(index=1) returns
{"video": video("c.mp4"), "relative_path": "c.mp4", "main_media_relative_path": "b.mp4"}

As a result, the embedding for c.mp4 gets saved as .precomputed/latents/b.pt. This leads to this video (c) being loaded with the caption for video b .precomputed/conditions/b.pt in src/ltxv_trainer/datasets.py:PrecomputedDataset.__getitem__.

Solution

Keep self.main_media_paths in sync with self.video_paths during filtering in scripts/process_videos.py:MediaDataset._filter_valid_videos().

lenn-arts · 2025-11-04T15:50:59Z

@matanby Have time to review this?

fix main_media_paths and valid_video_paths out of sync

c69f5ac

lenn-arts requested a review from matanby as a code owner October 8, 2025 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SEVERE: fix `main_media_paths` and `valid_video_paths` out of sync during filtering, causing condition-video misalignment #54

SEVERE: fix `main_media_paths` and `valid_video_paths` out of sync during filtering, causing condition-video misalignment #54

Uh oh!

lenn-arts commented Oct 8, 2025

Uh oh!

lenn-arts commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SEVERE: fix main_media_paths and valid_video_paths out of sync during filtering, causing condition-video misalignment #54

Are you sure you want to change the base?

SEVERE: fix main_media_paths and valid_video_paths out of sync during filtering, causing condition-video misalignment #54

Uh oh!

Conversation

lenn-arts commented Oct 8, 2025

Issue

Solution

Uh oh!

lenn-arts commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SEVERE: fix `main_media_paths` and `valid_video_paths` out of sync during filtering, causing condition-video misalignment #54

SEVERE: fix `main_media_paths` and `valid_video_paths` out of sync during filtering, causing condition-video misalignment #54