Our Vision for Video AI and Human-Centric Agential Video

Photorealistic, human-centric synchronous video agents that can change human lives for the better.

At Lyn, we believe that virtual AI, also known as AI-powered virtual reality, will drive most technological and scientific breakthroughs, serving as a testbed for the physical world and ultimately advancing humanity and civilization. An open-source and decentralized virtual AI infrastructure and engine will make everyone an equal contributor and beneficiary in this exciting future.

Our Vision. Our vision for the future is one where physical reality becomes indistinguishable from the virtual. In this future, any reality can be generated using an ultra-low-latency generative AI engine capable of conjuring the seemingly impossible for anyone, anywhere. This makes everyone on Earth their own creator, director, and owner of the virtual, such as a video, or narrative that represents their life. Human and societal evolution can unfold in an instant, with virtual experiences helping to drive the advancement of the physical world.

We believe that Lyn will be the first ecosystem to empower billions of people as equal participants in this exciting future. Moreover, we believe that video-generative AI will serve as the first step towards this future. In many ways, everything that we see and experience today is an abstraction of video - a sequence of unfolding moments that constitute what we call life. Humanity and civilization are developed from living experiences. However, our lives are constrained by the law of physics, the limitations of the physical world, and, of course, the boundaries of human existence. These constraints largely limit the progress of humanity and civilization.

Video Generative AI. Generative AI has recently made significant advancements in the development of large language models (LLMs). LLMs are being widely applied in various business domains, including conversational agents and chatbots (e.g., ChatGPT), code generation and debugging (e.g., Copilot), and creative writing and content generation. However, LLMs are inherently limited to texts, which represents only a fraction of the world’s information. Indeed, texts are high-level abstractions of lived experiences, lacking preciseness and richness, such as subjective and sensory dimensions, compared with video and audio data.

Due to its inherent limitations, textual data can capture only a small portion of the information of lived experiences such as art, music, nature, and gaming. In fact, texts alone capture at best around 20% of the information in experimental data. Video modality\footnote{Images are 1-frame videos.} plays a crucial role in complementing texts because they convey information that cannot be fully expressed through language alone. These non-textual modalities enrich communication, deepen understanding, and provide a more comprehensive representation of the world. For instance, videos are indispensable for depicting movement, change, and interaction over time, creating a sense of immersion. This is particularly important in storytelling, entertainment (e.g., gaming), education, and marketing, where creating an emotional connection or conveying a mood is essential.

Generative AI is increasingly enabling personalized experiences, from immersive entertainment content to customized product designs in manufacturing. In sectors such as education, healthcare, and customer service, AI models are expected to deliver hyper-personalized solutions that adjust in real time to individual needs. Our long-term vision is to provide personalized and interactive entertainment experiences, with a particular focus on gaming, over the next decade. Advances in brain-computer interaction technologies could eventually enable control-free interaction, delivering fully immersive experiences. This development may ultimately pave the way for highly interactive and lifelike virtual worlds, pushing the boundaries of virtual reality and potentially realizing the concept of world engines.

The first step toward achieving this long-term vision is to develop video and multimodal generative foundational models capable of compressing and storing vast amounts of world information. These models will be able to recover and generate most of the world’s information in a coherent and creative manner. In the near term, these generative foundational models could be used to create art, music, stories, videos, interactive films, or even open virtual worlds. By collaborating with human creators, these models have the potential to accelerate content generation and foster new forms of expression.

Empowering Everyone to Be the Owner. Currently, the world’s leading video generation models are all closed-source, controlled by giant technology companies that restrict access to the broader population. While this approach may maximize revenue, it significantly limits the development of collective human intelligence and perpetuates inequality, restricting innovation, data ownership, and freedom.

Lyn believes in an open future for AI. The most powerful video AI foundational model should be accessible to everyone and serve as a platform for creating applications, services, and personalized experiences. Humanity cannot afford to be dependent on a few powerful companies when it comes to life-altering technologies like video AI.

Ultra-realistic, Human-centric Video Agents

As technology increasingly moves towards more immersive and intuitive experiences, human-centric video agents represent the next frontier in user interaction. Unlike text-based AI or even audio-driven models, video agents provide a more dynamic, engaging interface. They offer visual cues such as facial expressions [98], gestures, and lip synchronization [57], creating a more human-like interaction experience. This creates a deeper emotional connection between users and the agent, leading to improved engagement and more productive interactions.

In the near future, we can expect the number of video agents to surpass the number of people, as agents become crucial for managing tasks, interacting with digital environments, and representing users across various applications. Agents will be used to handle the overwhelming number of services, platforms, and data streams users deal with daily, freeing people from mundane tasks. Their ability to act autonomously, learn, and adapt to user preferences makes them an indispensable tool in the rapidly growing digital landscape.

Video agents also have the potential to bridge communication gaps, particularly in industries like customer service, healthcare, education, and entertainment. By providing a familiar, human-like interface, these agents reduce the friction that users might experience with traditional AI models [42], which often lack the nuanced, expressive qualities of human communication. For example, a video agent responding to a customer query with empathetic facial expressions and synchronized speech will likely improve the user's overall experience, leading to greater satisfaction.

Furthermore, human-centric video agents enhance accessibility. Individuals with disabilities, such as hearing or vision impairments, can benefit from agents that provide multimodal communication options, using both visual and auditory cues [68]. This is particularly important as technology becomes increasingly integrated into everyday life and global communication. In sum, human-centric video agents foster more natural, inclusive, and effective interactions, making them a key development in the future of AI interfaces.

Decentralization: Ownership, Privacy, and Control

One of the most transformative aspects of decentralized video agents is the shift in ownership and control from centralized corporations to individual users. Traditional AI models often store user data on centralized servers, raising concerns about privacy, data security, and misuse of personal information. As the proliferation of video agents grows, so does the need for \textbf{data protection}, including the de-identification of personal information and the secure handling of user data.

Decentralized video agents, built on blockchain and Web3 technologies [7], give users full control over their data and interactions [63]. By employing data lockers, personal data is stored in encrypted containers, ensuring that it can only be accessed by the agent with user consent. The decentralization of data through on-chain storage further strengthens privacy by enabling secure, auditable, and tamper-proof management of sensitive information.

Video agents will handle an enormous amount of personal data, from everyday preferences to voice patterns, communication history, and even physical appearance. Therefore, techniques like de-identification, where sensitive data points are removed or anonymized, are essential to prevent unauthorized access or misuse. Secure data handling is crucial as these agents will represent their users in digital environments, often interacting with multiple services on their behalf. Users must feel confident that their agents are trustworthy and that their privacy is maintained in these interactions.

Another key aspect of decentralization is the ability for users to truly own their agents. In a Web3 ecosystem, users can customize and control their agents, creating a personal digital assistant that is not dependent on any single platform or corporation. This level of personalization enhances the value of video agents, as they can evolve alongside the user’s needs. Moreover, the decentralized nature ensures that agents can operate across multiple platforms without being tied to any one ecosystem, providing greater flexibility and autonomy to users.

Integrating Our Technology for Advanced Video Agents

In the previous sections, we outlined our cutting-edge autoregressive video generation technology and the video foundation models. These technologies have formed the backbone of our advanced video agent framework. The autoregressive models we’ve developed are optimized for sequential data generation, making them ideal for generating temporally consistent video frames and high-quality visual outputs [17].

A critical reason video agents will dominate future AI interactions is their ability to communicate fluidly through video. Video allows agents to mirror human communication, expressing emotions and reactions in real-time, a necessity for fostering trust and understanding. Whether providing customer support or guiding a user through a task, real-time video communication enables agents to act and react like a human would, making interactions feel more natural and engaging.

Video agents will also have the ability to learn from minimal data. By capturing just a few samples—such as the user’s voice or a single image or video—these agents can be trained to emulate the owner’s personality, voice, and mannerisms. Over time, they will re-train continuously, adapting to the user's evolving preferences and becoming more accurate in their representation. This will allow agents to mirror the user’s communication style, tone, and even personal quirks, making them a seamless extension of the individual.

For instance, with just one image or video from a user's camera roll, the agent can begin to generate video responses that mimic the user's appearance and body language. With additional voice samples, the agent can replicate the user's voice, allowing it to communicate with other agents and humans alike in a fully personalized manner. As users continue to interact with their agents, the models will re-train on new data, becoming more refined and closely resembling their owner in both speech and behavior.

This integration allows our video agents to be deployed across various decentralized platforms, ensuring that users experience high-quality, interactive video interactions while maintaining control over their data and privacy. The synergy between autoregressive technology and video agents is fundamental in allowing these agents to evolve with human-like fluency and adaptive behavior, making decentralized video agents not only scalable but also deeply personalized.

Lyn AI Research and Focus

At Lyn, our research pushes the boundaries of video-generative AI by tackling fundamental industry challenges, including data engineering pipelines, autoregressive modeling, efficient long-form video generation capable of directly producing high-quality movies, and hallucination reduction, as illustrated in Figure 1.

What comes first is data. To meet the industry’s demand for training hyper-real video generative models, we developed a computationally efficient and high-quality data preprocessing pipeline that can process any video input and produce high-quality, preprocessed data for model training. This pipeline significantly accelerates the video clipping process, making it highly efficient for large-scale use. Recognizing the limitations of both closed-source and open-source models in generating long, temporally consistent videos, we introduce a new autoregressive model that accounts for long-range dependencies and generates video frames in real-time, enabling efficient and temporally coherent long-form video generation.

To enhance the video generation reliability, we reduce hallucinations at the data level by improving the quality of captions associated with the video data. By enriching and refining these textual descriptions, we enable our models to produce content that is accurate and semantically consistent, thereby significantly enhancing overall video generation reliability.

Data Engineering Pipeline

A data engineering pipeline transforms large-scale online-scraped data into high-quality training datasets, which are essential for model training. This process is particularly critical for video generation models, as dataset quality directly impacts model performance. Current data engineering pipelines face several challenges, including prolonged video clipping times and discrepancies between automated scoring metrics and human subjective perception. To address these issues, we developed a robust data preprocessing tool that improves upon existing methods, focusing primarily on the following areas:

  • Scene Detection: To ensure trained models generate coherent videos, it is common practice to train video generative models using videos that consist of a single scene. We developed advanced classification algorithms to accurately categorize and classify video scenes. This classification ensures that the preprocessed data are well-structured, enabling models to generate contextually relevant and consistent content.

  • Video Clipping: Our pipeline leverages advanced scene detection techniques combined with multithreading to optimize video clipping time, reducing clipping time by over 75% compared to existing methods. Additionally, our new video clipping algorithm speeds up dataset preparation without compromising quality.

  • Video Filtering: Not all raw data can be directly used for model training, as some data may be of low quality—highly corrupted and potentially detrimental to model training. To address this, we introduced a series of video data filtering criteria, such as aesthetic and perceptual measures, that better align with human subjective judgments, ensuring the preprocessed data are of high quality. This filtering process ensures that the preprocessed data exhibit good properties such as better aesthetics, enhanced visual quality, and fast motions, all of which contribute to improved model convergence and performance during the training stage.

  • Video Caption: The development of large multimodal models is critical for video generation, especially in text-to-video generative models, as these models need to align multiple modalities. Existing methods struggle with capturing fine-grained details and temporal relationships, and the literature lacks comprehensive evaluation frameworks. To address these challenges, we benchmarked a large-scale dataset of 10 million video clips derived from over 230,000 documentaries and developed a framework for generating high-quality video captions. Specifically, we first introduced a robust evaluation framework that assesses captions based on their ability to capture dynamic sequences, context, and camera techniques. Using the Gemini-1.5-Pro model in conjunction with the evaluation framework, we developed an advanced annotation pipeline that produces descriptive and contextually rich captions, resulting in significant improvements in downstream video understanding and generation tasks.

Autoregressive Modeling

Tokenization is an important technique for compressing high-dimensional data into discrete tokens in a low-dimensional latent space. Vector quantization (VQ) is a popular tokenization method due to its computational efficiency. However, the quantization process in VQ and its variants faces significant challenges, notably the gradient gap issue arising from the non-differentiability of the quantization function during backpropagation, and the {codebook collapse} issue, where only a small subset of codebook vectors are utilized. These issues limit the model's representative power and thus hinder effective learning of complex structures such as temporal dynamics. To address the issues above, we introduce an improved vector quantization strategy by matching the distributions of continuous latent vectors and codebook vectors using the Wasserstein distance. This approach effectively reduces the quantization errors and prevents codebook collapse, thereby preserving the fidelity of the compressed video data.

Building on tokenized latent representation, most existing video generation pipelines rely on diffusion transformers (DiTs), which are computationally expensive and limited to producing short video clips. It is also challenging, if not impossible, to build real-time interactive video AI agents, either in the physical or virtual world, using DiTs due to their inherent limitations. To address these inefficiencies, we developed a new autoregressive model for video generation, capable of generating video frames in real-time, similar to ChatGPT’s next-token prediction mechanism. As a result of this next-frame-generation approach, our autoregressive model can produce videos of arbitrary length. Technically, we introduce a novel token masking strategy for video generation that allows for non-sequential token prediction by masking and predicting tokens across both spatial and temporal dimensions. This method enables the parallel generation of multiple video tokens, significantly improving both the speed and computational efficiency of video synthesis. Furthermore, by incorporating multi-scale spatial-temporal representations within our architecture, we achieve high fidelity in both intra-frame spatial details and inter-frame temporal dynamics.

From Video to Movie

Both open- and closed-source text-to-video generative models have been used. However, the literature lacks controllable video editing tools, such as editing an object or the background scene of a video. To address these limitations, we introduce composite video generation (COVG), a framework that integrates both text and video conditions into the generation process. This approach allows for controllable video editing. Specifically, COVG allows users to modify specific aspects of a video such as the object appearance or the background scene through text-based instructions while preserving other aspects of the source video.

Generating long, coherent, and visually consistent videos is a complex problem, requiring sophisticated modeling of both content and temporal coherence. Current video generation models often struggle with maintaining consistency across different frames, particularly when generating long videos. To address this challenge, we present VideoGen-of-Thought, a four-module framework inspired by the "thought process" behind the creation of long videos, as if an AI-driven movie director constructs the scenes step-by-step. Each module operates sequentially, with Module-1 generating textual descriptions of each video shot, Module-2 generating keyframes, Module-3 converting those keyframes into videos, and Module-4 handling transitions and camera adjustments. Each module is modeled mathematically to ensure the consistency of avatars, environments, and transitions across the entire video.

The performance of existing text-to-video generative models often falls short in terms of visual fidelity, coherence, and text-video alignment. To address these issues, we propose a post-processing approach that leverages rich human feedback to enhance the quality of generated videos. Human feedback, which captures nuanced visual, narrative, and contextual information, provides essential guidance for refining video outputs. Incorporating human evaluations helps correct technical distortions such as artifacts, blurring, and inconsistent frame transitions, improve motion coherence, and elevate overall aesthetic quality. This approach will be evaluated through both subjective human assessments and quantitative video quality metrics.

Reducing MLLM Hallucinations

Accurate and complete captions are essential for high-quality video generation, as they provide critical semantic grounding and help ensure that the generated content aligns with the text input. Despite recent advancements in multimodal large language models (MLLMs), challenges remain in generating accurate and thorough captions.

While recent MLLMs have achieved significant progress, they still exhibit vulnerability in providing incorrect answers even when correctly interpreting visual content, especially when confronted with misleading or leading questions. This reflects a broader issue of robustness in MLLMs when exposed to adversarial prompts, where models generate incorrect responses despite accurate visual comprehension. To address this, we introduce the MultiModal Robustness (MMR) benchmark, designed to rigorously evaluate MLLMs' understanding and resilience against such misleading prompts. The MMR benchmark includes paired positive and negative questions across various categories, revealing the susceptibility of MLLMs to adversarial inputs. We further strengthen model robustness by incorporating a new training set of paired visual question-answer samples, enabling fine-tuning that significantly improves resistance to misleading prompts and enhances answer accuracy.

Moreover, we tackle the problem of MLLMs generating content misaligned with the visual input due to phenomena such as attention collapse and over-reliance on anchor tokens. To address these hallucinations, we propose a Text-relevant Visual Token Selection mechanism that filters out redundant or irrelevant visual tokens early in the decoding process, preventing the collapse of attention onto a narrow set of tokens. This mechanism incorporates anchor token theory to detect and mitigate attention collapse, ensuring that the model remains focused on the appropriate visual tokens during inference. By distributing attention more effectively across relevant tokens and preventing over-concentration, this approach enhances the model’s ability to generate visually grounded and accurate content. Empirical results demonstrate that this method reduces hallucinations and improves the precision of generated descriptions, outperforming existing methods such as LLava-1.5.

Audio-Driven Video Generation

For the creation of photorealistic video agents, recent advancements in audio-driven video generation have introduced innovative methods for synthesizing highly realistic and temporally coherent visual content from audio inputs. We have developed key advancements in three areas: text-to-speech synthesis using diffusion bridge models, expressive talking head generation, and real-time conversational avatar animation. Each presents distinct contributions to overcoming the limitations of traditional approaches, particularly in generating high-fidelity, synchronized, and interactive audiovisual outputs.

The task of converting text into natural-sounding speech has historically relied on autoregressive models, which often result in machine-like outputs and are hindered by real-time application constraints. Diffusion-based models offer an alternative with superior speech quality but face challenges in efficiency and computational demands. To address these limitations, we use a Bridge-TTS model that leverages clean, deterministic priors derived from text representations to synthesize high-quality speech. By implementing the Diffusion Bridge, the model transitions between text latent and mel-spectrogram outputs using stochastic differential equations (SDEs), significantly improving real-time performance and maintaining high fidelity. The integration of ODE-based sampling further accelerates inference times, enabling real-time TTS with fewer steps and greater accuracy compared to previous methods like Grad-TTS.

The generation of lifelike talking faces for video agents requires capturing the complex interplay between facial dynamics, lip synchronization, and head movements, which are critical for realism in applications ranging from human-computer interaction to virtual avatars. Our proposed framework introduces a diffusion-based model that integrates audio and video inputs to generate highly expressive talking head videos from a single portrait image. Key components of this pipeline include an audio processor that extracts latent speech features, a video processor that captures head motion and facial expressions, and a landmark processor for extracting and modulating facial landmarks. These elements work together to generate dynamic facial movements, which are synthesized through a diffusion-based appearance generator and a warp model. This approach produces temporally coherent and high-fidelity videos with precise lip synchronization and natural head movements, advancing the state-of-the-art in expressive talking head generation.

Creating real-time interactive avatars is essential for conversational and multimedia application proficiency of video agents, yet traditional methods face significant latency issues due to the complexity of generating video frames in sync with live audio input. To overcome this, we propose a novel two-phase framework that combines pre-generated base frames with dynamic frame matching and real-time interpolation. The system uses real-time video frame interpolation techniques, such as RIFE, to ensure smooth transitions and high frame rates, enabling seamless user interactions. Additionally, the integration of large language models (LLMs) enriches the avatar’s capacity to engage in contextually relevant dialogue, incorporating sentiment analysis to reflect user emotions in facial expressions. This framework achieves real-time responsiveness, significantly improving user engagement by generating high-quality animations without compromising interactivity or output fidelity.

Last updated