Autoregressive Text-to-Visual Generation via Hybrid Architecture

A unique hybrid architecture of Mamba and Transformer for visual generation.

Text-to-image (T2I) generation, a technology that converts textual descriptions into visual representations, has garnered significant interest in the fields of computer vision and artificial intelligence. This technique has broad applications across various domains including entertainment, design, and education, making it a crucial area of research. As artificial intelligence continues to evolve, the efficient and accurate generation of images from text represents a pivotal challenge.

Overview of Existing Technologies. Presently, the leading technologies in text-to-image generation predominantly utilize diffusion models. However, recent advancements have highlighted the potential of autoregressive models, such as Parti, VAR, and LlamaGen, in this domain. These models are noted for their rapid generation capabilities and compatibility with optimization strategies used in large language models (LLMs), enhancing the understanding of textual descriptions and ensuring consistency across generated images.

Challenges. Despite these advancements, several challenges persist in the development of high-quality text-generated imagery. Current hurdles include the scarcity of mature autoregressive text-to-image models and the difficulty in handling detailed, context-rich descriptions that ensure the relevance of generated images to their textual counterparts. Moreover, while autoregressive models demonstrate potential, they often fall short of diffusion models in terms of the quality of the generated images and require substantial training time.

Preliminaries

Transformer Architecture. Transformer employs a structured arrangement of self-attention and MLP (Multi-Layer Perceptrons) blocks, systematically organized around a residual pathway. This configuration enhances the model's ability to facilitate smooth data processing and effective learning mechanisms. Below is a mathematical depiction of a standard layer in a transformer architecture:

x_{l+1} = x_l + \text{MLP}(\text{LN}(\text{Self-Attention}(\text{LN}(x_l))))

In this equation, $x_l$ represents the activation levels in the residual pathway at layer $l$ , demonstrating the layered progression where each stage incrementally builds upon the previous layer’s outputs.

Mamba Architecture. In the realm of State Space Models (SSMs), Mamba stands out by employing a linear dynamical core system for blending sequences. Mamba's innovative approach involves making its control and observation matrices, as well as the timestep for these processes, responsive to the input variations:

h_{t+1} = \exp(A \delta_t) h_t + B_t x_t \\ y_t = C_t h_{t + 1}

In these formulas, $x_t$ is the input at each point in the sequence, $h_t$ represents the internal state or `memory' of the system, $y_t$ is the produced output, and $\delta_t$ is a timestep that adjusts based on input, enhancing the model’s adaptability. The matrices $B_t$ and $C_t$ vary based on input, whereas the matrix (A) remains constant, independent of input variations.

Mamba also uniquely combines token processing with sequence mixing. It begins with an initial linear layer that separates the input into vectors for processing and gating. This precedes a 1D convolution stage which conducts initial sequence mixing, setting the stage for the dynamic generation of the matrices $B_t$ , $C_t$ , and $\delta_t$ :

\text{Mamba}(x) = \text{Lin}(\sigma(\text{Lin}(x)) \odot \text{SSM}(\text{Conv1D}(\text{Lin}(x))))

Following this, the output from the SSM is multiplied element-wise by the gating vector before being routed through another linear layer for final output processing. This approach not only allows for intricate token processing, similar to MLPs found in Transformer, but it also incorporates SSM-based sequence mixing. Consequently, while Mamba may contain more sequence mixers compared to a Transformer of equivalent size, it might exhibit a slightly reduced expressivity in full self-attention scenarios.

Our Model: VMambAR

To address prevailing challenges, we present VMambAR, a novel architecture that effectively integrates the Vector Autoregression (VAR) approach with a hybrid Transformer-Mamba framework, drawing inspiration from the research on Zamba. This model excels in rapid inference and exhibits enhanced memory efficiency, effectively handling extensive contextual information. The integration facilitates superior performance and heightened operational efficiency.

VMambAR Architecture. The VMambAR model ingeniously combines the contextual comprehension strengths of Transformer—particularly noted for their few-shot learning capabilities—with the fast inference speed of Mamba, and the robust scalability of VAR.

Central to VMambAR is an autoregressive backbone designed to sequentially generate image tokens from text, complemented by a VAR decoder to finalize image generation. The model's architecture notably includes a global shared self-attention (GSA) block integrated every $N$ Mamba units to enhance continuity and feature integration:

y_l = \text{Lin}_l(\text{MLP}(\text{LN}(\text{Self-Attention}(\text{LN}([x_l, x_0])))))

Subsequently, every $N$ blocks, the output from the GSA block is synthesized with the input, processed through a Mamba block:

x_{l+1} = x_l + \text{Mamba}(\text{LN}(x_l + y_l))

For our implementation, we set $N=6$ .

Data Preparation and Model Training. In terms of data preparation, we utilize the CC3M [51] dataset, which comprises 3 million text-image pairs, for our text-to-image training. The model undergoes training over 15 epochs using all available CC3M data, with a learning rate of 2e-4 and a batch size of 512. Training procedures are executed on 16 NVIDIA H20 GPUs.

In our ongoing research, we aim to develop a VAR-based text-to-image generation model, VMambAR, that incorporates the advantages of Transformer and SSM. The specific technical contributions include an innovative model design that integrates multiple architectures to optimize the text-to-image generation process; performance tests conducted on various datasets to verify improvements over existing technologies; exploration of new training methods, such as using the LLaMA model trained by LlamaGen for knowledge distillation; and the adoption of 2D Rotary Positional Encoding (RoPE) to encode image positions, preserving the spatial information of images.

Conclusion In this section we discuss the innovations in Autoregressive modeling employed by Everlyn. A key focus is on vector quantization, which optimizes the encoding of video tokens into discrete latent spaces, essential for accurate video generation. Here we address major challenges like codebook collapse and gradient gap, offering solutions that ensure efficient token utilization. Additionally, advanced techniques such as spatial-temporal masking and hierarchical scaling further enhance the model's ability to generate coherent and high-quality videos, both spatially and temporally. These innovations pave the way for more complex video generation tasks, including the token-based generation with hybrid architecture.

PreviousEfficient Autoregressive Video Generation via Token Masking NextFrom Video to Movie: Composite Video Editing and RHF for Quality

Last updated 9 months ago