VideoGen-of-Thought

A novel approach to the generation of long, consistently structured and homogenous content.

Generating long, coherent, and visually consistent videos is a complex problem, requiring sophisticated modeling of both content and temporal coherence. Current video generation models often struggle with maintaining consistency across different frames, particularly when generating long videos. To address this challenge, we present VideoGen-of-Thought, a four-module framework designed to generate long videos by dividing the video generation process into manageable stages: script generation, keyframe production, shot-level video creation, and video smoothing. By leveraging large language models (LLMs) and advanced image and video generation techniques, we develop a structured approach to generate high-quality videos with a clear narrative over extended periods.

Our approach is inspired by the "thought process" behind the creation of long videos, as if an AI-driven movie director constructs the scenes step-by-step. Each module operates sequentially, with Module-1 generating textual descriptions of each video shot, Module-2 generating keyframes, Module-3 converting those keyframes into videos, and Module-4 handling transitions and camera adjustments. Each module is modeled mathematically to ensure the consistency of avatars, environments, and transitions across the entire video.

Module 1: Script Generation

The first module in the VideoGen-of-Thought framework is the Script Generation Module, responsible for converting a high-level story input into fine-grained textual prompts, each corresponding to a short 2-5 second segment of the final video. This module ensures that the video narrative is well-structured, with each scene described in sufficient detail to guide the subsequent keyframe and video generation steps.

The input to this module is a simple text-based story description s, the module then generates a series of shot scripts S, each capturing a different part of the narrative. For each story segment s_i, the large language model (LLM) generates detailed shot descriptions that include five key domains: character, background, relation, camera pose, and HDR description. Each aspect is denoted by the following symbols:

p_i: Character description
b_i: Background description
r_i: Relation between characters and elements in the scene
c_i: Camera pose description
h_i: HDR lighting and tone details

Additionally, the LLM will specify an image path for each character's face ID to ensure consistency in the generated keyframes. The script generation task is modeled as a sequence generation problem using a pre-trained large language model (LLM). Given an input story S, represented as a sequence of story segments S = {s_1, s_2, ..., s_n}, where each s_i corresponds to a specific section of the narrative, the module generates a set of descriptive prompts P = {p_1, p_2, ..., p_m}. Each p_j is a textual description of a video shot, capturing the five domains mentioned above.

This process can be formalized as:

p_j = \mathcal{M}_{\text{LLM}}(s_i, t_j, c_i, b_i, r_i, p_i, h_i)

where t_j represents the time interval for the j-th shot, and M_{LLM} is the large language model responsible for generating the prompt based on the story segment s_i and the shot-specific details c_i, b_i, r_i, p_i, and h_i. The generated prompt sequence P serves as the input for the keyframe generation process in Module 2.

The LLM is conditioned on both the current story segment s_i and the previously generated prompts p_{j-1} to maintain narrative coherence and avoid abrupt transitions between scenes. The algorithm for script generation is iterative, producing a prompt for each video shot:

By breaking the story into these structured time intervals and shot descriptions, the Script Generation Module ensures a detailed and coherent narrative structure, which forms the foundation for generating visually consistent and temporally coherent videos in the subsequent stages.

Module 2: Keyframe Generation

The second module in the VideoGen-of-Thought framework is the Keyframe Generation Module, responsible for converting the textual descriptions from Module 1 into visual keyframes. These keyframes act as essential anchors for the subsequent video generation process, ensuring that each shot is visually represented with consistency. The main challenge of this module lies in maintaining visual coherence across keyframes, especially in terms of the appearance of characters (avatars), backgrounds, and other significant elements throughout the video.

We address this challenge by combining a text-to-image generation process based on diffusion models with an identity-preserving (IP) embedding strategy inspired by the IP-Adapter model. This ensures that the generated keyframes remain visually consistent across consecutive shots.

Diffusion Models for Keyframe Generation. The keyframe generation process in the VideoGen-of-Thought framework uses denoising diffusion probabilistic models (DDPMs), which are highly effective for image generation tasks. A diffusion model progressively corrupts an image x_0 by adding Gaussian noise over T time steps, resulting in a noisy latent variable x_T. The reverse process then denoises this noisy sample back into the clean image x_0.

In the forward process, Gaussian noise is added to the image x_0 at each step, governed by the transition distribution:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})

where \beta_t is the noise variance, and I is the identity matrix. Over T steps, this process transforms the initial image into pure noise.

In the reverse process, the model learns to recover the clean image x_0 by predicting and removing the added noise at each step. The reverse transition is defined as:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

where \mu_\theta and \Sigma_\theta are the predicted mean and variance. During training, the model minimizes the difference between the actual and predicted noise using the following loss:

\mathcal{L}{\text{diffusion}} = \mathbb{E}{x_0, \epsilon \sim \mathcal{N}(0, \mathbf{I}), t} \left[ | \epsilon - \epsilon_\theta(x_t, t) |^2 \right]

Given a textual prompt p_j, the keyframe I_j is generated using a pre-trained text-to-image model M_{T2I}, conditioned on the text prompt p_j. This process can be formalized as:

I_j = \mathcal{M}_{\text{T2I}}(p_j) = x_0

where x_0 is the clean image output from the diffusion process after denoising.

IP Embedding with Decoupled Cross-Attention Mechanism. To maintain visual consistency across keyframes, we incorporate an identity-preserving (IP) embedding strategy, inspired by the IP-Adapter model. This approach ensures that recurring visual elements, such as avatars, retain consistent appearances across multiple shots through a decoupled cross-attention mechanism.

In typical diffusion models, cross-attention layers integrate text features into the image generation process. Let Q, K, and V represent the query, key, and value matrices. The attention mechanism, where Q is derived from the image features and K and V from text features, is defined as:

\text{Attention}(Q, K, V) = \text{Softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V,

To enhance visual coherence, we decouple the cross-attention by introducing an additional block that considers the visual features of the previous keyframe. This mechanism ensures that each keyframe not only matches the text prompt but also maintains visual consistency with the previous frame. The updated attention can be written as:

Z_{\text{new}} = \text{Attention}(Q, K, V) + \lambda \cdot \text{Attention}(Q, K', V'),

where K' and V' represent the key and value matrices from the visual features of the previous keyframe, and \lambda balances the influence between the current and previous frames. This approach ensures smooth transitions in the visual representation of characters, backgrounds, and objects.

Keyframe Generation Process. The function of the Keyframe Generation Module is to create visual keyframes that correspond to the textual prompts generated by the script in Module 1. This is achieved by leveraging the text-to-image diffusion model and the IP embedding strategy to ensure consistency across keyframes.

The keyframe generation algorithm processes the sequence of prompts P = {p_1, p_2, ..., p_m}, generating keyframes while ensuring identity-preserving consistency between consecutive shots. Mathematically, let P = {p_1, p_2, ..., p_m} represent the sequence of prompts, and I = {I_1, I_2, ..., I_m} represent the corresponding keyframes. The generation of each keyframe I_j is formalized as:

I_j = \mathcal{M}_{\text{T2I}}(p_j).

The decoupled cross-attention mechanism aligns both the textual description and the visual features from the previous keyframe, ensuring visual coherence across consecutive shots. By combining the text-to-image diffusion model with the identity-preserving embedding strategy and decoupled cross-attention, the Keyframe Generation Module produces high-quality, visually consistent keyframes for the subsequent video creation process.

Module 3: Shot-level Video Generation

The third module in the VideoGen-of-Thought framework is the Shot-Level Video Generation Module, responsible for converting the keyframes generated in Module 2 into short video segments. Each segment, typically lasting 2-5 seconds, corresponds to a shot in the final long video. The challenge of this module is to ensure temporal consistency, i.e., generating smooth motion between frames within each shot while maintaining coherence established by the keyframes.

To achieve realism and fluency within each shot, we introduce a camera embedding mechanism, which integrates camera movement information during video generation. The camera movement is represented by parameters c_{cam} = [c_x, c_y, c_z], where c_x, c_y, and c_z represent the horizontal pan, vertical tilt, and zoom ratios, respectively provided by each p_j in Module 1. The encoded camera embedding e_{cam} is then incorporated into the video generation process using temporal cross-attention. This embedding ensures that camera movements are smoothly integrated into the shot generation process.

To ensure consistency and smooth transitions between frames, we use a camera-conditioned loss. The training objective becomes minimizing the difference between the predicted noise and the actual noise, conditioned on both the camera embedding and the frame at time t. The loss is formulated as:

\mathcal{L} = \mathbb{E}{x_0, c{\text{cam}}, t, \epsilon \sim \mathcal{N}(0, I)} \left[ |\epsilon - \epsilon_\theta(x_t, c_{\text{cam}}, t)|^2_2 \right]

where x_t represents the noisy frame at time step t, and \epsilon_\theta is the predicted noise by the model at time t, conditioned on the camera movement parameters c_{cam}. This loss ensures smooth motion between consecutive frames by taking camera movement into account.

We eventually employ a text-and-image-conditioned video generation model that takes both the keyframes and the corresponding textual prompts as input to generate temporally consistent and visually coherent videos. Let p_j represent the textual description of shot j, and I_j denote the keyframe for that shot. The goal is to generate a video segment V_j that is both temporally consistent with the keyframe and coherent with the prompt:

V_j = \mathcal{M}_{\text{T2IV}}(I_j, p_j)

where V_j is the video sequence generated for shot j. By incorporating both the keyframe and the textual description into the video generation process, along with the camera movement parameters, the Shot-Level Video Generation Module ensures that each shot remains temporally coherent and visually consistent. This alignment with the narrative structure provided by the prompt allows for the generation of high-quality, smooth video segments, which are later integrated into the final long video.

Module 4: Video Smoothing and Transitions

The final module in the VideoGen-of-Thought framework is the Video Smoothing and Transition Module, responsible for ensuring temporal coherence and smooth transitions between the shot-level video segments generated in Module 3. The primary objective of this module is to create seamless transitions between consecutive shots, adjusting camera poses and blending motion dynamics to produce a cohesive long-form video.

Let V_j(T) represent the final frame of shot j, and V_{j+1}(1) represent the first frame of the subsequent shot j+1. The transition video segment T_j is generated to smoothly blend these two frames. This process can be formalized as a conditional diffusion process:

p_\theta(T_j | V_j(T), V_{j+1}(1)) = \mathcal{N}(T_j; \mu_\theta(V_j(T), V_{j+1}(1)), \Sigma_\theta)

where \mu_\theta and \Sigma_\theta are the predicted mean and variance, and T_j represents the generated transition video. The latent representations of V_j(T) and V_{j+1}(1) condition the diffusion process, ensuring that the generated transition aligns with the visual dynamics of the adjacent shots.

To further refine the temporal coherence of transitions, we use Gaussian Process Regression (GPR) to interpolate between the latent embeddings of the final frame of one shot and the first frame of the next. Let z_j represent the latent embedding of the final frame V_j(T), and z_{j+1} represent the latent embedding of the first frame V_{j+1}(1). The latent transition is modeled as:

f(z_j, z_{j+1}) \sim \mathcal{GP}(\mu, k(z_j, z_{j+1}))

where GP is a Gaussian process with mean \mu and kernel k(·, ·), ensuring smooth transitions between the latent embeddings. z_j and z_{j+1} are the latent embeddings of the final and initial frames. This combined approach ensures that the transitions are temporally smooth, visually coherent, and consistent across the entire video.

To summarize, we proposed VideoGen-of-Thought, a novel framework for generating long videos with a multi-step, collaborative approach. By decomposing the problem into four distinct modules—script generation, keyframe production, shot-level video generation, and smoothing—we ensure both visual and narrative consistency throughout the generated video, our approach make each module integrate into a cohesive pipeline, producing high-quality long videos with minimal manual intervention.

PreviousFrom Video to Movie: Composite Video Editing and RHF for Quality NextSupercharging MLLMs and LVLMs

Last updated 9 months ago