Efficient Autoregressive Video Generation via Token Masking

Efficient modeling of complex spatial-temporal dynamics in video data.

Building upon our optimized vector quantization strategy, we now address the challenge of efficiently modeling the complex spatial-temporal dynamics in video data for autoregressive generation. Traditional autoregressive models predict tokens sequentially, which can be computationally expensive and inefficient for high-dimensional video data. To overcome this limitation, we propose a novel Token Masking strategy that enables non-sequential token prediction by masking and predicting tokens across both spatial and temporal dimensions. This approach leverages the tokenized representation from vector quantization and allows for parallel token generation, significantly improving the speed and efficiency of video synthesis.

Mask-and-Predict Strategy for Video Generation

The core idea of Token Masking lies in its mask-and-predict strategy. Instead of predicting tokens strictly in a left-to-right or frame-by-frame order, we randomly mask a subset of latent tokens in both spatial and temporal dimensions. The model then predicts the masked tokens based on the unmasked context, effectively capturing dependencies across the entire video sequence.

Let Z = {Z1, Z2, . . . , ZT } denote the sequence of latent tokens obtained from the VQ-VAE encoder for a video of length T , where each Zt = {Z_t^(1) , Z_t^(2) , . . . , Z_t^(N) } represents the tokens at time t. During training, we randomly mask a subset of these tokens, resulting in:

\mathcal{Z}_t^{\text{masked}} = \{Z_t^{(1)}, [\text{MASK}], Z_t^{(3)}, \dots, [\text{MASK}]\}.

The model is trained to predict the masked tokens ˆZ_i using the unmasked tokens and any available conditional information, such as textual input. This approach allows the model to leverage both spatial and temporal context, improving its ability to model complex dependencies in video data.

Parallel Token Generation and Computational Efficiency

By enabling non-sequential token prediction, Token Masking allows for the parallel generation of multiple tokens. This significantly reduces the computational burden associated with traditional autoregressive models, which require sequential prediction and cannot fully exploit parallel processing capabilities. In our approach, the model can predict all masked tokens simultaneously, leveraging efficient parallel computation and accelerating the video generation process.

Spatial and Temporal Masking Strategy

To effectively capture both spatial details and temporal dynamics, we employ a combined spatial-temporal masking strategy. At each training iteration, we apply masking in both dimensions:

• Spatial Masking: Randomly mask tokens within individual frames to encourage the model to learn spatial dependencies and generate coherent visual content within frames.

• Temporal Masking: Randomly mask tokens across different frames at the same spatial locations to encourage the model to learn temporal dependencies and maintain consistency over time.

For example, the masked tokens may look like:

\mathcal{Z}^{\text{masked}} = \{[Z_{t-1}^{(i)}], Z_t^{(i)}, [Z_{t+1}^{(i)}], \dots\},

where Z_t^(i) represents the token at spatial position i and time t. This multi-scale masking strategy ensures that the model learns to capture both fine-grained spatial details and smooth temporal transitions.

Our Token Masking strategy is integrated within our autoregressive video generation architecture. The masked tokens are processed by a transformer- based model that predicts the missing tokens conditioned on the unmasked tokens and any additional inputs, such as text embeddings. This setup allows the model to generate video sequences that are coherent both spatially and temporally, while also adhering to any provided conditional information.

In initial experiments, we observed that Token Masking improves the efficiency of video generation and enables the model to produce coherent video sequences. However, we also encountered challenges related to noise and temporal inconsistencies, particularly when masking a large proportion of tokens. We are exploring strategies to mitigate these issues, such as refining the masking ratio, improving the loss function, and incorporating additional regularization techniques.

To summarize, We presented our Token Masking strategy for efficient autoregressive video generation. By enabling non-sequential and parallel token prediction across spatial and temporal dimensions, our approach addresses the computational inefficiencies of traditional autoregressive models. This method leverages the tokenized representations from vector quantization and effectively captures complex spatial-temporal dependencies, paving the way for faster and more scalable video generation.

PreviousHierarchical Spatial-Temporal Video Generation Architecture NextAutoregressive Text-to-Visual Generation via Hybrid Architecture

Last updated 9 months ago