Towards Intelligent Video Captioning and Annotation
Improving video captioning is another crucial focus in advancing multimodal large language models (MLLMs). During the development of Openlyn, a high-quality dataset of hundreds of millions of video clips was developed. A robust evaluation framework was also introduced to assess captions, measuring how well they capture dynamic sequences, background details, and camera techniques. Using major mainstream models, dense captions averaging 427 tokens were generated, describing main objects, environments, and stylistic elements. These captions significantly enhanced video understanding and generation tasks, demonstrating substantial improvements in the accuracy and coherence of AI-generated content.
The ability of Multimodal Large Language Models (MLLMs) to effectively understand and generate video content is crucial for unlocking their full potential across a wide range of applications [63, 66, 73]. From revolutionizing autonomous driving and embodied intelligence to transforming education, healthcare, and entertainment, and of course to building a competent foundational video AI model, the implications are vast [68]. High-quality video captioning stands as a critical bridge in this domain, enabling models to accurately interpret visual information and generate contextually relevant content. However, existing approaches to video captioning suffer from limitations in capturing fine-grained details, temporal relationships, and a lack of comprehensive evaluation frameworks [7, 27, 30, 67].
Our work directly addresses these challenges by presenting a comprehensive pipeline for generating high-quality video captions and demonstrating their significant impact on downstream MLLM tasks. Our contributions are multi-fold:
Large-Scale, High-Quality Dataset: We developed a dataset of hundreds of millions of video clips, providing a diverse and extensive resource for training and evaluation. Robust Evaluation Framework: We introduce a novel evaluation framework that moves beyond simple object recognition to assess captions based on their ability to capture dynamic sequences, background context, and camera techniques. Advanced Annotation Pipeline: Leveraging the power of Gemini-1.5-Pro and our evaluation framework, we develop a robust annotation pipeline that generates dense, descriptive, and contextually rich captions. Downstream Task Validation: We rigorously validate the effectiveness of our approach by demonstrating significant performance improvements on both video understanding and generation tasks.
Our work establishes a new standard in video captioning, pushing the boundaries of MLLM capabilities and paving the way for the next generation of intelligent video applications.
High-Quality Documentary Video
In this dataset experiment to test our intelligent video captioning pipeline, we work with a very large selection of high quality documentaries. These documentaries cover a wide range of subjects, including natural scenery, interviews, and urban and rural areas. Additionally, the lengths of these documentaries vary from a few minutes to several hours, giving us a large selection space. The video resolutions range from 360p to 4K, providing a good diversity for video generation tasks. We perform meticulous slicing and screening of these documentaries through a detailed video preprocessing pipeline, ultimately obtaining over one hundred million video clips.
Video Caption Evaluation
Our evaluation philosophy is to measure whether the video captions can completely and accurately reflect the video content as much as possible. To this end, we design a comprehensive video evaluation framework that assesses video captions from four dimensions: main objects, dynamic sequences, background and environment, and camera usage. We further divide these four dimensions into a total of 11 sub-dimensions and formulate scoring guidelines for each dimension. Finally, we use GPT to inspect the quality of the captions based on the given scoring guidelines. After comparing with human evaluations, our evaluation method shows a high level of consistency with human evaluation results, validating the effectiveness of our method.
Video Caption Annotation
After model evaluation and human evaluation, we decide to use Gemini-1.5-Pro to annotate the sample data. First, we extract the highest quality frame per second from the video at a frequency of 1 fps. Then, we use the image, the corresponding timestamp in the video, and the instruction for the model to generate dense captions as the model input, resulting in dense captions with an average length of 427 tokens. These captions include descriptions of the main objects, background, lighting, style, and camera language in the video. Ultimately, we annotate millions of video clips using Gemini-1.5-Pro.
Video Caption Model
After obtaining the annotated data, we use the evaluation module to filter the data quality, selecting high-quality data in the millions of entries from the initial dataset. We then train a caption model based on LongLLaVA. The trained caption model performs excellently on evaluation metrics. Additionally, thanks to the model's architecture, our caption model can perform inference with very high efficiency. Ultimately, we use the caption model to annotate video clips, resulting in millions of high-quality dense video caption data entries.
Downstream Task Verification
We use our annotated data to perform quality validation on both video understanding and video generation downstream tasks. For the video understanding task, we adopt the LLaVA series training strategy and incorporate our data into the alignment phase for training. The results show that high-quality video caption data significantly improve the model's performance in video understanding tasks. For the video generation task, we fine-tune the model using our caption data based on the Open Sora Plan. The final model exhibits improvements in both the quality of the generated videos and the consistency between images and videos. These validations on the two downstream tasks confirm the reliability of our data quality.
Our innovative evaluation framework ensures the accuracy and completeness of video captions, aligning closely with human evaluations. The annotated data, filtered and refined through our quality evaluation module, has proven to be invaluable in downstream applications, leading to notable improvements in model performance. This work not only addresses the existing limitations in video captioning but also sets a new benchmark for future research and applications in this domain.
Conclusion
We have demonstrated the importance of a robust and efficient data preprocessing and video captioning pipeline. Our preprocessing framework tackles common challenges in video generation, such as inaccurate segmentation, mismatched metrics, and inefficient filtering. By leveraging advanced tools for scene detection, aesthetic evaluation, and optical flow analysis, we have optimized data quality and processing speed, ensuring more relevant input for video models. The integration of a sophisticated video captioning pipeline further enhances the model’s ability to generate accurate, detailed content, significantly boosting performance in video understanding and generation tasks. Together, these systems lay a solid foundation for advancing AI video generation, setting new benchmarks for quality and efficiency. Moving forward, we will focus on addressing challenges in autoregressive modeling, building directly on the improved data from these pipelines.
Last updated