Data Curation Demystified for Stable Video Diffusion

Stable Video Diffusion showcases the critical role of data curation in developing state-of-the-art video generation models.

Stability AI recently unveiled its latest model, Stable Video Diffusion (SVD), on November 21, 2023. This breakthrough in video generation models hinges on the pivotal role of data curation. Along with the model checkpoints, they also published a technical report. Let’s dive into this new video data curation approach, guided by Stability AI’s technical report and some compelling example videos.

GIF Image

Since the findings from this report focus on the data curation part, they can be paired with other ongoing research focusing on model architectures or training and inference approaches, such as Make Pixels Dance, 2023 which was released just a few days earlier.

Why is Data Curation Crucial?

In AI, data quality often trumps quantity. Stability AI’s research stresses the need for high-quality data, removing lesser-quality data to boost model performance. A striking example is seen in Figure 3b of their report. Here, a dataset four times smaller yet carefully curated was preferred for its accuracy and overall quality compared to a larger, randomly sampled set.

Figure 3b from the Stable Video Diffusion report shows that despite being 4x smaller the curated video dataset results in a model more preferred by human raters.

LVD-10M contains 10 million randomly subsampled videos, whereas LVD-10M-F contains around 2.5 million curated videos. Even though the curated set is 4x smaller, the user study shows that in terms of both prompt alignment and quality, the users prefer the model trained on the curated data.

Let’s dive into the key components of how they were able to curate the video data so well!

Video Data Curation: Key Components

The Stable Video Diffusion technical report describes the following five key components as part of their video data curation pipeline:

  • Detecting Scene Cuts: To avoid misleading the AI with edited videos containing multiple scenes, a mechanism to detect and separate scene cuts is employed. This ensures accurate scene depiction in training.
  • Synthetic Captioning: Leveraging Google Research’s CoCa model, captions are generated for video clips, crucial for text-conditioned video generation.
  • Movement Detection Using Optical Flow: This technique captures motion in video clips, a vital aspect in filtering out static videos.
  • Text Overlay Detection with OCR: Identifying and removing clips with excessive text overlays maintains the training focus on video content rather than textual distractions.
  • CLIP-based Scoring: This assesses the aesthetic appeal and text-image alignment, further refining the dataset.
Core components of the video data curation pipeline for Stable Video Diffusion. From left to right: Cut detector, Caption summary generation, Optical flow for motion estimation, OCR detection, alignment with CLIP features for aesthetics, and the summary.

We will now go through the individual components and explain them in detail.

Detecting Scene Cuts

When working with videos from the web, there is a high chance you end up with videos that are edited and contain several clips merged together. Think of a movie scene where the camera jumps from one actor to another. Scene cuts are not bad per se, but we have to deal with them properly during the training of generative models. The situation we want to prevent is that we treat several clips as a single clip just because they are all part of the same video. This could result in a single caption describing completely different scenes (think of a fail compilation video: https://www.youtube.com/watch?v=IOwLVfO_xZM). Our model will get confused during the training process as it has to generate several different scenes based on an unrelated caption.

The advantage of the cascaded video cut detection. Figure 11 from the Stable Video Diffusion report.

To mitigate this problem, Stable Video Diffusion proposes a mechanism to detect scene cuts and treat them as individual clips further down the processing pipeline.

Importance of cut detection and movement detection using optical flow. From the Stable Video Diffusion technical report.

An important part of the cut detection implemented in the report is the “cascading” of it. Running a detector at different frame rates helps to capture also “slow” changes, such as when two clips are blended during a transition.

Synthetic Captioning

To generate videos conditioned on text, we need captions or summaries describing the content of the video clips to train our model on. The authors used CoCa, 2022 to create captions for the middle frame of each of the clips.

CoCa is a paper from Google Research that builds on methods such as CLIP. Whereas clip trains an image and a separate text encoder to put image-text pairs in the same embedding space, CoCa additionally tries to reconstruct the original caption based solely on the image features. Think of this as CLIP + captioning loss. CLIP cannot be used to obtain the caption of an image. We would need to go backwards through the model (from the embedding of an image to the text input).

Illustration of the CoCa learning process. The image encoder + unimodal text decoder and contrastive loss are basically the CLIP paper. CoCa adds an additional multimodal text decoder + captioning loss on top of CLIP.

A trained CoCa model, on the other hand, can use the additional text decoder to create captions. Since CoCa works on single frames the authors of Stable Video Diffusion also use VideoBLIP (modified BLIP-2 code) to create additional captions using the first, the middle, and the last frame.

Finally, the authors use a not further defined LLM to take both summaries (CoCa and VideoBLIP) and create a final summary caption for each video clip.

Detecting Still Videos

The web is full of videos that are basically still images with an audio track. Lots of video clips on YouTube, such as this example (Fragments of Time — Ben Böhmer) show a static image without any movement.

The additional issue with having still videos in the training set is that our model might not be able to decide when to generate motion videos and when to generate still videos.

Example output of an optical flow algorithm. For features in the image (here a grid pattern), the algorithm tries to find the movement between two frames. The arrows show the direction of the “optical flow” between two frames. Source: https://www.edge-ai-vision.com/2019/03/an-introduction-to-the-nvidia-optical-flow-sdk/

An easy way to detect motion in videos is to check how much the pixels change from frame to frame. Optical flow is a related research field that addresses this challenge. Optical flow methods try to represent the motion of parts of the frame. In the Stable Video Diffusion report, the authors calculated the average movement between two frames. This allows us to compute scores for every single video clip about how much movement is happening on average.

Caption similarities and Aesthetics

To further improve the dataset, the Stable Video Diffusion uses CLIP embeddings of the captions (output of the LLM in the synthetic captioning step) as well as the first, middle and last video frames. The similarity between the caption and the frames is used to verify that they match. An additional aesthetics score is used to classify frames as being visually aesthetic. The score is obtained by fitting a linear layer on top of the CLIP features, as outlined in the LAION-5B paper.

Text Detection

Videos from the web can contain lots of text overlays. This text can disturb the training process if not explicitly captured in the captioning process. The authors decided to remove video clips above a certain threshold of text content. For that, they used an off-the-shelf text detector called CRAFT and ran it on the first, middle and last frame of each video clip.

The CRAFT detector detects individual character regions, which are then post processed to obtain bounding boxes.

Conclusion

The Stable Video Diffusion paper highlights the importance of data curation for generative models in the video generation space. The results outlined in the paper match the results from other papers and also our own experiments, we have summarised here.

Igor Susmelj,
Co-Founder Lightly