-
Notifications
You must be signed in to change notification settings - Fork 20
feat: Longlive VACE FFLF pipeline layer support #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Longlive VACE FFLF pipeline layer support #287
Conversation
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
d2ca37b to
34fe179
Compare
| # Transpose [B, F, C, H, W] -> [B, C, F, H, W] and concatenate along channel dim | ||
|
|
||
| inactive_out = vae.encode_to_latent(inactive_stacked, use_cache=use_cache) | ||
| reactive_out = vae.encode_to_latent(reactive_stacked, use_cache=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_cache was set to True here in #283 to fix an issue with white flashing when using depth control videos.
- Is this change necessary for FFLF?
- If yes, we need a different solution for the white flashing issue with depth control videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. It is required for the gradual last frame use case, which I think is an important one. The temporal blending weakens the diffusion-driven transformation such that it looks more like simple latent interpolation than anything else.
1output_extension_scale_0.00to1.00_weak_middle_8chunks.mp4
output_extension_scale_0.00to1.00_weak_middle_8chunks.mp4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Are you planning on addressing the white flashing issue with depth maps separately then?
- What are these videos showing? Is it a comparison of using cache and not using cache during encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanontheinside Following up on an offline convo:
Just want to make clear the problem I see with setting use_cache=False for the reactive portion right now in your change. This seems like it would be a regression if this PR is merged as-is.
I used this as the input control video:
AnimateDiff_00003_scaled_5x.mp4
I applied this diff:
diff --git a/src/scope/core/pipelines/longlive/test_vace.py b/src/scope/core/pipelines/longlive/test_vace.py
index 2d81c60..a8b9036 100644
--- a/src/scope/core/pipelines/longlive/test_vace.py
+++ b/src/scope/core/pipelines/longlive/test_vace.py
@@ -45,16 +45,16 @@ from .pipeline import LongLivePipeline
CONFIG = {
# ===== MODE SELECTION =====
"use_r2v": False, # Reference-to-Video: condition on reference images
- "use_depth": False, # Depth guidance: structural control via depth maps
+ "use_depth": True, # Depth guidance: structural control via depth maps
"use_inpainting": False, # Inpainting: masked video-to-video generation
- "use_extension": True, # Extension mode: temporal generation (firstframe/lastframe/firstlastframe)
+ "use_extension": False, # Extension mode: temporal generation (firstframe/lastframe/firstlastframe)
# ===== INPUT PATHS =====
# R2V: List of reference image paths (condition entire video, don't appear in output)
"ref_images": [
"frontend/public/assets/example.png", # path/to/image.png
],
# Depth: Path to depth map video (grayscale or RGB, will be converted)
- "depth_video": "vace_tests/control_frames_depth.mp4", # path/to/depth_video.mp4
+ "depth_video": "vace_tests/AnimateDiff_00003_scaled_5x.mp4", # path/to/depth_video.mp4
# Inpainting: Input video and mask video paths
"input_video": "frontend/public/assets/test.mp4", # path/to/input_video.mp4
"mask_video": "vace_tests/circle_mask.mp4", # path/to/mask_video.mp4
@@ -65,14 +65,14 @@ CONFIG = {
# ===== GENERATION PARAMETERS =====
"prompt": None, # Set to override mode-specific prompts, or None to use defaults
"prompt_r2v": "", # Default prompt for R2V mode
- "prompt_depth": "a cat walking towards the camera", # Default prompt for depth mode
+ "prompt_depth": "a woman dancing", # Default prompt for depth mode
"prompt_inpainting": "a fireball", # Default prompt for inpainting mode
"prompt_extension": "", # Default prompt for extension mode
- "num_chunks": 2, # Number of generation chunks
+ "num_chunks": 50, # Number of generation chunks
"frames_per_chunk": 12, # Frames per chunk (12 = 3 latent * 4 temporal upsample)
- "height": 512,
- "width": 512,
- "vace_context_scale": 1.5, # VACE conditioning strength
+ "height": 480,
+ "width": 832,
+ "vace_context_scale": 1.0, # VACE conditioning strength
# ===== INPAINTING SPECIFIC =====
"mask_threshold": 0.5, # Threshold for binarizing mask (0-1)
"mask_value": 127, # Gray value for masked regions (0-255)
@@ -490,7 +490,7 @@ def main():
print("Initializing pipeline...")
I ran:
uv run -m scope.core.pipelines.longlive.test_vace
I got:
output_depth.mp4
Observe the white flashing effect throughout this output video.
| "extension_mode", | ||
| default=None, | ||
| type_hint=str, | ||
| description="Extension mode for temporal generation: 'firstframe' (ref at start, generate after), 'lastframe' (generate before, ref at end), or 'firstlastframe' (refs at both ends). Applies to specific chunks based on current_start_frame.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an explicit concept of extension mode? Or can the mode be implicit based on whether first_frame_image and/or last_frame_image is provided?
Eg.
- If
first_frame_image, but nolast_frame_imagethen we start withfirst_frame_imageand generate rest. - If
last_frame_image, but nofirst_frame_imagethen we generate everything withlast_frame_imageat end. - If both are present, then start with
first_frame_imageend withlast_frame_imageand generate everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryanontheinside Following up on offline convo:
I have a pref for making the extension mode implicit based on whether first_frame_image and/or last_frame_image is provided because it would simplify the pipeline API usage by avoiding the need for an additional extension_mode param.
Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
This adds FFLF support for Longlive at the pipeline layer. API and UI to follow.