SEVERE: fix main_media_paths and valid_video_paths out of sync during filtering, causing condition-video misalignment
#54
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sharing the fix to a bug that cost me a lot of time to find during fine-tuning. High severity since it makes results of fine-tuning unusable.
Issue
Location:
scripts/preprocess_dataset.py>scripts/process_videos.py>MediaDataset>_filter_valid_videos()Relevant for:
Finetuning on any dataset that has conditions and some videos that may get filtered out.
TLDR:
MediaDatasetfilters videos upon instantiation by removing file paths of videos that are, for instance, too short relative to bucket size fromself.video_paths.MediaDatasetalso keepsself.main_media_pathswhich in many cases is equal toself.video_pathsand determines the save filepath of the computed VAE embedding. However, filtering only changesself.video_paths, NOTself.main_media_paths, so that embeddings of kept videos get saved with name of filtered-out videos. This causes kept videos to be loaded with the conditioning of other videos, and the model may get confused by the lack of correlation of conditioning and video.Simple example:
Raw videos in FT dataset:
a.mp4b.mp4c.mp4Precomputed text condition embeddings:
.precomputed/conditions/a.pt.precomputed/conditions/b.pt.precomputed/conditions/c.ptself.video_pathsbefore filtering = [a.mp4,b.mp4,c.mp4]self.main_media_pathsbefore filtering = [a.mp4,b.mp4,c.mp4]self.video_pathsAFTER filtering (remove b because too short) = [a.mp4,c.mp4]self.main_media_pathsAFTER filtering (doesn't get changed) = [a.mp4,b.mp4,c.mp4]MediaDataset.__getitem__(index=1)returns{"video": video("c.mp4"), "relative_path": "c.mp4", "main_media_relative_path": "b.mp4"}As a result, the embedding for
c.mp4gets saved as.precomputed/latents/b.pt. This leads to this video (c) being loaded with the caption for video b.precomputed/conditions/b.ptinsrc/ltxv_trainer/datasets.py:PrecomputedDataset.__getitem__.Solution
Keep
self.main_media_pathsin sync withself.video_pathsduring filtering inscripts/process_videos.py:MediaDataset._filter_valid_videos().