Hierarchical Spatial-Temporal Video Generation Architecture

Encoding video into multi-scale latent video tokens and decoding video tokens back into the pixel 21 domain.

The VideoAR architecture integrates a video tokenizer that encodes video into multi-scale latent video tokens and decodes video tokens back into the pixel 21 domain. Additionally, it includes a transformer-based token generator that receives text guidance to autoregressively generate video tokens. The latent code produced by the VAE is critical for representing video frames at varying levels of detail, capturing both spatial structure within frames and temporal dynamics across frames. These latent tokens are organized into multi-scale representations such as 1 × 1 × 1, 2 × 2 × 2, and 4 × 4 × 4, allowing the model to progressively refine the video content across different resolutions.

Multi-scale Latent Tokens: Spatial and Temporal Scales

In VideoAR, the video tokenizer maps the input video frames into multi-scale latent spaces, and the latent tokens at different scales contain spatiotemporal information at varying levels of granularity. Therefore, The latent tokens are structured in a hierarchical manner, allowing for multi-scale processing where each scale represents a finer or coarser resolution of both spatial and temporal features.

The 3D latent tokens at scale (s,t), i.e., spatial scale s and temporal scale t, are denoted by $Z^{(s, t)}_{i}$ , with the size of s × s × t. These tokens encapsulate both spatial dimensions (height and width within a frame) and the temporal dimension (time or frames):

\mathcal{Z}^{(s,t)} = \{Z^{(s, t)}_{1}, Z^{(s, t)}_{2}, \dots, Z^{(s, t)}_{N_{s,t}}\}

where Ns,t is the number of latent tokens at scale (s,t), and each token $Z^{(s, t)}_{i}$ encodes a specific patch of the video across spatial and temporal dimensions. The hierarchical progression of scales starts from a coarse representation, such as 1×1×1, and refines the representation through larger latent token structures such as 4×4×4.

Hierarchical Scaling of Latent Tokens

The latent tokens represent the video content at varying levels of detail, both spatially and temporally. Starting at a coarse level, the latent tokens are of size 1 × 1 × 1, which represents the entire video at its coarsest spatial and temporal resolution. This captures the global structure of the video content but lacks fine details. The progression to finer scales is achieved by increasing the size of the tokens, such as 2 × 2 × 2, which divides both the spatial and temporal dimensions into finer patches:

Z^{(1 \times 1 \times 1)} \quad \rightarrow \quad Z^{(2 \times 2 \times 2)} \quad \rightarrow \quad Z^{(4 \times 4 \times 4)}

At each level of scaling, the spatial and temporal information is progressively refined, capturing both the intra-frame spatial details and the inter-frame temporal transitions.

Video Generation with Textual Guidance

The transformer-based video generator receives text tokens that serve as semantic guidance during the video generation process. These text tokens, denoted as Ttext, represent the information contained in the input text, such as the objects, actions, and scenes described. During inference, the text tokens are combined with the latent tokens generated at each iteration to condition the next-scale video token generation process, ensuring that the video content aligns with the semantics of the input text.

The text tokens are represented as:

\mathcal{T}_{\text{text}} = \{T_{\text{text}}^{(1)}, T_{\text{text}}^{(2)}, \dots, T_{\text{text}}^{(n)}\}

These text tokens are processed by a transformer, which integrates them with the spatial-temporal latent tokens to generate a video that aligns with the textual description.

Spatial-Temporal Scaling Process

At each spatial-temporal scale, the latent tokens from the VAE and the text tokens from the text tokenizer are combined through a transformer architecture. The latent tokens from both the spatial and temporal dimensions are progressively refined across multiple scales, ensuring that the video cap- tures finer spatial details and more consistent temporal transitions as the model moves to higher scales.

The transformation from one spatial-temporal scale to the next can be rep- resented as:

\mathcal{Z}^{(s+1)} = \text{Transformer}(\mathcal{Z}^{(s)}, \mathcal{T}_{\text{text}})

where $\mathcal{Z}^{(s+1)}$ represents the refined latent tokens at the next spatial-temporal scale, and the transformer ensures that the video content adheres to both the spatial-temporal dynamics and the semantics of the text.

Scale Upsampler and Final Refinement

The scale upsampler in VideoAR further refines the latent tokens by increasing both the spatial and temporal resolution. As the latent tokens progress through each scale, the upsampler enhances their resolution, ensuring that the final video output is detailed and temporally coherent. The upsampling process continues until the latent tokens reach the target resolution, producing a high-quality video output that is aligned with the input text.

The upsampling process can be modeled as:

\mathcal{Z}^{(s+1)} = \text{Upsampler}(\mathcal{Z}^{(s)})

This equation represents the final refinement of the latent tokens through upsampling, ensuring that both spatial and temporal dimensions are sufficiently detailed.

The Next Spatial-Temporal Scale in VideoAR leverages the latent to- kens generated by a VAE to represent both spatial and temporal dimensions at varying scales. These latent tokens are progressively refined through hierarchical scaling, capturing fine-grained details at higher scales. The integration of text tokens allows for semantic conditioning, ensuring that the generated video content aligns with the textual input. The overall architecture ensures that the video is both spatially detailed and temporally coherent across multiple scales.

PreviousAutoregressive Modeling with Vector Quantization NextEfficient Autoregressive Video Generation via Token Masking

Last updated 9 months ago