Openlyn's Data Pre-processing Pipeline

A proprietary data pre-processing pipeline developed in-house for best-in-class speed and performance.

To enhance AI video generation, data preprocessing has become a critical step. This process involves filtering and optimizing datasets to align machine metrics more closely with human perception. Advanced models like Delegate Transformer have been fine-tuned to improve the assessment of aesthetic features, while optical flow methods help capture accurate object motion. Additionally, tools like DBNet++ are employed to filter out video frames with excessive text, and Places365 is used for scene classification, ensuring the quality and relevance of the training data. These improvements in preprocessing have resulted in more efficient video segmentation and data handling, providing a strong foundation for training advanced video models.

The current data preprocessing tools today have revealed a number of issues during testing, such as the long time consumed by video cutting, frequent scene transitions in the resulting video clips, inconsistency between scoring metrics and human perception, and an incomplete filtering system. To address these problems, we have developed a proprietary data pre-processing tool. Compared to existing tools, ours significantly accelerates video clipping (Table 1), as shown in the Table, aligns scoring metrics more closely with human perception, and features a more comprehensive filtering system. Additionally, our tool can automatically process all videos on the system, ensuring efficient and consistent data handling. The following sections will introduce the implementation plan and performance comparisons of the tool.

Video Clips

Through extensive testing, we have identified the best video scene detection tool, which is the Pyscenedetect toolbox. We evaluated various detection methods on large-scale video datasets and optimized the method and parameters for the most accurate results. Once the necessary time segments for cutting are identified, we use the FFmpeg tool to split the videos. Compared to existing tools [79], our approach demonstrates a significant advantage in video segmentation.

Scoring

Aesthetic Score Existing image aesthetics assessment (IAA) datasets mainly focus on evaluating the overall quality of images but lack detailed color annotations and contain limited color types or combinations. These datasets also suffer from selection bias. For instance, in the AVA dataset [43], approximately 50% of the images are black and white, significantly outnumbering other colors by a factor of 10 to 100 times. Similarly, the PCCD [4] and SPAQ [12] datasets have very few images featuring colors like "pink" or "violet." To address this problem, we utilize the Delegate Transformer model, pre-trained on the AVA dataset and fine-tuned on the Image Color Aesthetics Assessment (ICAA) dataset [20], which provides more accurate color aesthetics evaluations.

For video aesthetics assessment, we use the Aesthetic Score Predictor [50], a model trained on 176,000 Simulacra Aesthetic Captions (SAC) pairs, 15,000 LAION-Logos pairs, and 250,000 AVA image-text pairs. This model offers a more robust and detailed evaluation of the aesthetic quality of videos compared to traditional methods.

Optical Flow Score The existing methods for calculating optical flow often deviate significantly from human subjective perception [64]. Through our research, we identified that traditional algorithms [3] already have solutions for calculating local object motion in videos. Using these algorithms, we can accurately determine whether the motion in a video meets our desired criteria. Below is a detailed explanation of the algorithm.

Optical flow works on several assumptions:

The pixel intensities of an object do not change between consecutive frames.
Neighboring pixels have similar motion. Consider a pixel I(x, y, t) in the first frame (note that the time dimension t is now introduced since we’re working with videos). As the pixel moves by a distance (dx, dy) in the next frame, taken after a time interval dt, we can assume:

Taking a Taylor series approximation on the right-hand side, removing common terms, and dividing by dt, we arrive at the following equation:

where:

This is known as the Optical Flow equation. While we can compute f_x and f_y as image gradients, u and v remain unknown. Since there is one equation but two unknowns, this system is underdetermined. To solve this, the Lucas-Kanade method is applied.

The Lucas-Kanade method assumes that all neighboring pixels have similar motion. It selects a 3 × 3 pixel patch around the target point, allowing us to compute f_x, f_y, and ft for these 9 points. Now, we have a system of 9 equations with 2 unknowns, which is overdetermined. The least squares method is then used to find the optimal solution for u and v.

Check the similarity of the inverse matrix with the Harris corner detector. It denotes that corners are better points to be tracked. One challenge arises when dealing with larger motions. To address this, we apply pyramids: a technique where large motions are broken down into smaller motions as we move up the pyramid, allowing the Lucas-Kanade method to effectively calculate optical flow at different scales.

OCR Score Certain videos, such as news broadcasts and advertisements, contain dense text scenes that are not suitable for training purposes. To filter out these types of videos, we apply Optical Character Recognition (OCR) to detect text within video frames and remove samples with an excessive amount of text. For this task, we use DBNet++ [35], which is specifically designed for detecting dense text in images and videos. By utilizing this approach, we ensure that videos with high amounts of text are effectively filtered out, improving the overall quality and relevance of the training data.

Video Scene Classification

We utilize the Places365 [80] to classify video scenes. Places365 is a subset of the larger Places2 Database, which is designed specifically for scene recognition tasks. There are two versions of Places365: Places365-Standard and Places365- Challenge. The training set for Places365-Standard includes approximately 1.8 million images from 365 scene categories, with a maximum of 5,000 images per category. Meanwhile, the Places365-Challenge training set expands this by adding an additional 6.2 million images, totaling around 8 million images, with a maximum of 40,000 images per category.

We have trained several baseline Convolutional Neural Networks (CNNs) on the Places365-Standard dataset and used these models to classify the scenes in our video data. This allows us to categorize video clips based on the environments they depict, ensuring that we can accurately filter and label video content according to specific scene categories.

Filtering Criteria Specification

For existing video generation models, the quality of the videos plays a critical role in determining the final output. To enhance training efficiency and provide richer information, we apply a detailed filtering mechanism and assign simple classification labels to each video. This ensures that during training, more informative and relevant data is utilized. Below is the filtering process we follow next section

Training Data Source Categorization

For training purposes, we categorize video data according to the following ranking system, which combines aesthetic and color metrics. This enables us to fine-tune or pre-train models more effectively based on data quality:

Through extensive testing and comparison of existing video data preprocessing tools and various state-of-the-art models, we have developed an optimal data preprocessing tool. Our solution outperforms existing methods in both speed and performance. Additionally, we have automated the entire pipeline, which runs at regular intervals (every m minute), automatically detecting and processing pending video data. The system also manages interruptions, allowing for seamless resumption of data processing without losing progress, even when faced with server instability (e.g., H100 servers). This ensures consistent and efficient handling of large-scale video data, significantly improving the overall training process.

PreviousSupercharging MLLMs and LVLMs NextTowards Intelligent Video Captioning and Annotation

Last updated 2 months ago