Autoregressive Modeling with Vector Quantization
Features and approaches of the first-of-its-kind foundational video AI model powering video agents in the Lyn ecosystem.
Last updated
Features and approaches of the first-of-its-kind foundational video AI model powering video agents in the Lyn ecosystem.
Last updated
Vector quantization (VQ) plays a crucial role in video generation by encoding and decoding continuous feature spaces into discrete latent tokens. In the proposed Video VQ-VAE model, videos are compressed into quantized latent representations using 3D-Convolutional Networks, achieving an impressive compression ratio of 256. However, the quantization process faces two significant challenges: the gradient gap, caused by the non-differentiability of the quantization function during backpropagation, and codebook collapse, where only a small subset of codebook vectors are utilized. These issues are addressed through a distributional alignment strategy, aligning the distribution of latent vectors and codebook vectors using Wasserstein distance, thus minimizing quantization error and improving codebook utilization.
Building on the optimized vector quantization, a novel Token Masking strategy enhances the efficiency of autoregressive video generation by enabling non-sequential token prediction. This approach introduces a mask-and-predict mechanism where tokens are masked randomly across both spatial and temporal dimensions, and the model predicts the masked tokens based on the unmasked context. This strategy allows the model to capture dependencies across entire video sequences, significantly improving both computational efficiency and generation speed. By leveraging parallel token generation, the Token Masking strategy accelerates video synthesis and addresses the inefficiencies of traditional autoregressive models, which require sequential predictions.
The architecture also incorporates a hierarchical spatial-temporal structure, combined with a hybrid Transformer-Mamba framework in the VMambAR model. This system encodes video content into multi-scale latent tokens, refining both spatial and temporal details at varying scales. The integration of text tokens enables semantic conditioning, allowing for coherent video generation aligned with textual inputs. Additionally, the Mamba framework enhances memory efficiency and supports rapid inference, making VMambAR suitable for handling complex video generation tasks. By combining these advanced strategies—vector quantization, token masking, hierarchical scaling, and the Mamba architecture—the system sets a new standard for efficient and scalable autoregressive video generation.
Vector Quantization (VQ) is a technique for efficient encoding and decoding of continuous feature spaces in video generation, where latent vectors representing image or video tokens are mapped to a finite set of codebook vectors. In the training process, images or videos are encoded to discrete latent space by encoder for auto-regressive generative model. The latent vectors are decoded by the decoder in inference, shown as figure 2. The proposed Video VQ-VAEmodeladopts 3D-CovNet to temp-spatially compress videos into quan tized latent representations and can reconstruct them accurately with 256 (4 × 8 × 8) compression ratio.
The quantization process faces significant challenges: (1) the gradient gap, stemming from the non-differentiability of the quantization function during backpropagation, and (2) codebook collapse, where only a small subset of codebook vectors is utilized, limiting the representational power of the model.
To formally define these issues, consider a VQ model where an encoder E(x) maps input data (e.g., a video) to continuous latent vectors ze. A quantizer Q(·) then maps each continuous latent vector to its nearest codebook vector:
where C is the codebook containing K vectors e_k.
During backpropagation, the non-differentiability of Q(·) prevents direct gra dient flow from the quantized vectors z_q to the continuous latent vectors z_e, which is typically addressed using a straight-through estimator (STE). How ever, the STE becomes unstable when the quantization error (i.e., the distance between z_e and z_q) is large, creating a significant gradient gap.
The second issue, codebook collapse, occurs when only a small subset of codebook vectors are updated during training, while the rest remain unused. Formally, the codebook utilization is given by:
where Z is the set of quantized latent vectors, and I(·) is the indicator function. Ideally, U(C,Z) should approach 1, indicating full utilization of the codebook.
To address both the gradient gap and codebook collapse, we propose aligning the distribution of the continuous latent vectors with the distribution of the codebook vectors. Let the latent vectors z_e follow a distribution P_A, and let the codebook vectors e_k follow a distribution P_B. Our goal is to minimize the dis crepancy between these distributions, which we formalize using the Wasserstein distance:
where Γ(P_A,P_B) is the set of all joint distributions whose marginals are P_A and P_B.
For the optimal codebook distribution, we assume the continuous latent vectors are independently sampled from a distribution P_A with support Ω ⊂ R^d. The codebook vectors, drawn from a distribution P_B, should minimize the expected quantization error:
We show that the support of P_B must match that of P_A for the quantization error to asymptotically vanish as the number of codebook vectors K increases.
Theorem 1. Let Ω = supp(P_A) be a bounded open set. The codebook distribu tion P_B attains full utilization and vanishing quantization error asymptotically if and only if supp(P_B) = supp(P_A).
To validate our approach, we conducted a series of experiments comparing our distributional alignment method to baseline VQ models. We observed significant improvements in both codebook utilization and quantization error.
Figure 3 illustrates the quantization error Q as a function of various parameters including the mean μ, variance σ^2, codebook size K, latent dimension d, and dataset size N . The results show that larger codebook sizes reduce quantization error across all settings, while higher latent dimensions and larger variances increase the error significantly.
Our method consistently outperforms the baseline, demonstrating lower quantization errors across various settings. Specifically, as shown in Figure 3(a), in- creasing the codebook size K leads to a significant reduction in quantization error, validating our theoretical findings. Similarly, Figures 3(b) and 3(c) indicate that higher latent dimensions and larger feature sizes increase the quantization error, highlighting the importance of balancing these parameters.
By aligning the distributions of the continuous latent vectors and the code- book vectors, our optimized vector quantization strategy effectively resolves both the gradient gap and codebook collapse issues. This leads to improved performance in video generation tasks, as the model can utilize the full capacity of the codebook and achieve lower quantization errors.