Skip to content

Conversation

@ryanontheinside
Copy link
Collaborator

This adds FFLF support for Longlive at the pipeline layer. API and UI to follow.

Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
# Transpose [B, F, C, H, W] -> [B, C, F, H, W] and concatenate along channel dim

inactive_out = vae.encode_to_latent(inactive_stacked, use_cache=use_cache)
reactive_out = vae.encode_to_latent(reactive_stacked, use_cache=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_cache was set to True here in #283 to fix an issue with white flashing when using depth control videos.

  1. Is this change necessary for FFLF?
  2. If yes, we need a different solution for the white flashing issue with depth control videos.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. It is required for the gradual last frame use case, which I think is an important one. The temporal blending weakens the diffusion-driven transformation such that it looks more like simple latent interpolation than anything else.

1output_extension_scale_0.00to1.00_weak_middle_8chunks.mp4
output_extension_scale_0.00to1.00_weak_middle_8chunks.mp4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are you planning on addressing the white flashing issue with depth maps separately then?
  2. What are these videos showing? Is it a comparison of using cache and not using cache during encoding?

Copy link
Contributor

@yondonfu yondonfu Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanontheinside Following up on an offline convo:

Just want to make clear the problem I see with setting use_cache=False for the reactive portion right now in your change. This seems like it would be a regression if this PR is merged as-is.

I used this as the input control video:

AnimateDiff_00003_scaled_5x.mp4

I applied this diff:

diff --git a/src/scope/core/pipelines/longlive/test_vace.py b/src/scope/core/pipelines/longlive/test_vace.py
index 2d81c60..a8b9036 100644
--- a/src/scope/core/pipelines/longlive/test_vace.py
+++ b/src/scope/core/pipelines/longlive/test_vace.py
@@ -45,16 +45,16 @@ from .pipeline import LongLivePipeline
 CONFIG = {
     # ===== MODE SELECTION =====
     "use_r2v": False,  # Reference-to-Video: condition on reference images
-    "use_depth": False,  # Depth guidance: structural control via depth maps
+    "use_depth": True,  # Depth guidance: structural control via depth maps
     "use_inpainting": False,  # Inpainting: masked video-to-video generation
-    "use_extension": True,  # Extension mode: temporal generation (firstframe/lastframe/firstlastframe)
+    "use_extension": False,  # Extension mode: temporal generation (firstframe/lastframe/firstlastframe)
     # ===== INPUT PATHS =====
     # R2V: List of reference image paths (condition entire video, don't appear in output)
     "ref_images": [
         "frontend/public/assets/example.png",  # path/to/image.png
     ],
     # Depth: Path to depth map video (grayscale or RGB, will be converted)
-    "depth_video": "vace_tests/control_frames_depth.mp4",  # path/to/depth_video.mp4
+    "depth_video": "vace_tests/AnimateDiff_00003_scaled_5x.mp4",  # path/to/depth_video.mp4
     # Inpainting: Input video and mask video paths
     "input_video": "frontend/public/assets/test.mp4",  # path/to/input_video.mp4
     "mask_video": "vace_tests/circle_mask.mp4",  # path/to/mask_video.mp4
@@ -65,14 +65,14 @@ CONFIG = {
     # ===== GENERATION PARAMETERS =====
     "prompt": None,  # Set to override mode-specific prompts, or None to use defaults
     "prompt_r2v": "",  # Default prompt for R2V mode
-    "prompt_depth": "a cat walking towards the camera",  # Default prompt for depth mode
+    "prompt_depth": "a woman dancing",  # Default prompt for depth mode
     "prompt_inpainting": "a fireball",  # Default prompt for inpainting mode
     "prompt_extension": "",  # Default prompt for extension mode
-    "num_chunks": 2,  # Number of generation chunks
+    "num_chunks": 50,  # Number of generation chunks
     "frames_per_chunk": 12,  # Frames per chunk (12 = 3 latent * 4 temporal upsample)
-    "height": 512,
-    "width": 512,
-    "vace_context_scale": 1.5,  # VACE conditioning strength
+    "height": 480,
+    "width": 832,
+    "vace_context_scale": 1.0,  # VACE conditioning strength
     # ===== INPAINTING SPECIFIC =====
     "mask_threshold": 0.5,  # Threshold for binarizing mask (0-1)
     "mask_value": 127,  # Gray value for masked regions (0-255)
@@ -490,7 +490,7 @@ def main():
     print("Initializing pipeline...")

I ran:

uv run -m scope.core.pipelines.longlive.test_vace

I got:

output_depth.mp4

Observe the white flashing effect throughout this output video.

"extension_mode",
default=None,
type_hint=str,
description="Extension mode for temporal generation: 'firstframe' (ref at start, generate after), 'lastframe' (generate before, ref at end), or 'firstlastframe' (refs at both ends). Applies to specific chunks based on current_start_frame.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an explicit concept of extension mode? Or can the mode be implicit based on whether first_frame_image and/or last_frame_image is provided?

Eg.

  1. If first_frame_image, but no last_frame_image then we start with first_frame_image and generate rest.
  2. If last_frame_image, but no first_frame_image then we generate everything with last_frame_image at end.
  3. If both are present, then start with first_frame_image end with last_frame_image and generate everything else.

Copy link
Contributor

@yondonfu yondonfu Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryanontheinside Following up on offline convo:

I have a pref for making the extension mode implicit based on whether first_frame_image and/or last_frame_image is provided because it would simplify the pipeline API usage by avoiding the need for an additional extension_mode param.

Signed-off-by: RyanOnTheInside <7623207+ryanontheinside@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants