Optimizing Generative AI: The Role of Data Curation
Exploring the pivotal role of data curation in Generative AI: A deep dive into experiments with diffusion models. Discover the balance between data quality and model efficacy.
In the field of artificial intelligence, the emphasis has often been on amassing more data. But as generative AI models, especially in computer vision, gain prominence, the focus is shifting towards the quality of data over its quantity. The move towards self-supervised learning methods seemed to increase the need for volume. However, our research into data curation for generative AI suggests otherwise.
This article delves into the role of curated data in generative AI and aims to address a central question: How does data curation impact the optimization of generative AI models in computer vision?
Data in AI Training: Evolution and Implications
Deep learning’s advancement is fundamentally tied to the data it consumes. Traditionally, vast data volumes were believed to optimize model performance, prompting a race to acquire and deploy as much data as possible. However, recent shifts towards self-supervised learning, as seen in foundational models such as CLIP (Radford et al., 2021) and LLAMA (Touvron et al., 2023), challenge this belief. These models use vast corpuses, comprising millions of images or billions of tokens, that go beyond human annotation capabilities.
Yet, a growing body of research suggests that the sheer volume isn’t the sole key to success. Papers like LLAMA (Touvron et al., 2023), DINOv2 (Oquab et al., 2023), LLAVA (Liu et al., 2023), and the recent Emu (Dai et al., 2023) and MetaCLIP (Xu et al., 2023) all indicate a consistent pattern:
Models can achieve superior performance when fine-tuned or trained from scratch on smaller but high-quality datasets.
For instance, the PixArt-α (Chen et al., 2023) model highlights that improved captions in image-text pair datasets notably enhance vision-language models.
Given this backdrop, our investigation centers on the impact of data curation methods on the training of generative AI models, specifically diffusion models. By scrutinizing data curation’s role in this domain, we aim to provide a more nuanced understanding of optimizing AI training.
We start out doing experiments on data curation for Generative AI models in the computer vision domain. More specifically, we try to answer the question which data curation methods have the biggest impact for training high quality diffusion models.
When we set out with the experiments we had the following hypothesis:
- Generative models such as GANs and Diffusion models benefit from having diverse training data
- Outliers harm the training process of generative models as it is inherently difficult to learn concepts from few examples
Significance of Data Curation in AI Training
The efficacy of a generative model is heavily contingent on the data it’s fed. Data curation emerges as an essential process here, primarily for two reasons:
- Quality over Quantity: As suggested by papers like LLAMA, DINOv2, and LLAVA, superior model performance can often be achieved with smaller, high-quality datasets than massive, uncurated ones. Data curation ensures that training datasets are devoid of noise, irrelevant instances, and duplications, thus maximizing the efficiency of every training iteration.
- Guided Data Distribution: In the absence of data curation, we remain at the mercy of raw datasets, with limited control over data distribution. Data curation allows for a nuanced selection of data points, ensuring the models aren’t skewed due to biases or disproportionate representation. Techniques ranging from simple deduplication to sophisticated data selection algorithms can be employed to finetune this distribution, ensuring the trained models behave predictably and effectively.
Experiments: Probing Data Curation’s Impact on Generative AI Models
In this section we describe the experiments as well as the evaluation protocol we used.
Dataset and Preprocessing
We employed the Virtual Tryon Dataset for our experiments, consisting of 11,647 images. To ensure model compatibility and maintain data quality, we subjected the images to:
- Center cropping
- Resizing to a resolution of 128x128 pixels.
Model Architecture and Training Parameters
Our experiments centered on the Denoising Diffusion Probabilistic Model (DDPM) sourced from this GitHub repository. Key training parameters include:
- Batch size: 32
- Training duration: Approximately 12 hours on a single RTX 4090 GPU
- Iterations: 70,000 steps for the full training set (~11k images) and 20,000 steps for all 1k subset experiments.
Embeddings and Sampling
We leveraged CLIP ViT-B/32 for subsampling and DINOv2 ViT-L/14 embeddings for our metrics, providing a comprehensive evaluation framework for our generative outputs.
We use two different embedding models for sampling and metrics to make the evaluation metrics more independent from the data subsampling method.
We evaluate the following sampling methods in our experiments:
- Random: We randomly subsample 1,000 images from the full training set
- Coreset: We use the Coreset algorithm to find the 1,000 most diverse images based on their CLIP embeddings.
- Typicality: We use a mix of diversity and cluster density to subsample 1,000 images.
The big difference between Coreset and Typicality is that Coreset includes all the outliers as well as they are far away from the cluster centers. Coreset ignores the density of the data distribution. Coreset does not select nearby duplicates as they would be too close to each other.
Typicality on the other hand tries to find samples are the in the dense regions (have many similar samples) while still keeping a distance between the selected samples. This approach is neither selecting outliers nor nearby duplicates.
Metrics and Evaluation
Consistency in evaluation is paramount. We adopted the following metrics:
- FID (Frechet Inception Distance, Heusel et al., 2017): A widely accepted metric for evaluating generative models.
- Precision & Recall (Kynkäänniemi et al., 2019): To assess model accuracy and its ability to capture data distribution.
All models underwent evaluations from 10,000 sampled images. The FID values (mean + std) from two seeds were reported to ascertain model reliability.
Results
In total we trained over 8 different diffusion models (4 experiments with two seeds each). We first compare the results of the metrics for the different subsampling methods.
Metrics-based Evaluation
We evaluate FID (lower is better), precision (higher is better) and recall (higher is better). The following plots show the mean and standard deviation of the results.
Looking at the plots you see that there is a clear correlation of lower FID and higher precision & recall values. This is expected as all metrics try to capture the quality of the generated data distribution. Interestingly, the coreset method performs the worst having the highest FID and lowest precision & recall values. Our assumption that having many edge cases disturbs the training process seems to be valid. We suggest further research in this direction validate this claim and preliminary results.
What is surprising is that the Typicality data selection method is able to outperform the random subsampling approach. On first thought random should perform best as it matches the exact training distribution of the full training set. However, these first results indicate that having a balance between diversity and typicality can benefit the training process of the models.
We also put these experiments in perspective of training a model on the full dataset consisting of 11,647 training images.
We can conclude the following ranking of the models:
- 1st place: Full dataset (11647 images)
- 2nd place: The 1000 images typicality subsampled
- 3rd place: The 1000 images randomly subsampled
- 4th place: The 1000 images Coreset subsampled
Human Study Evaluation
To further assess the perceived quality of the different models, we also conducted a user study. The goal is to evaluate which subsampling method creates the best results based on human perception. Furthermore, we want to assess how well FID, precision, and recall metrics reflect the human rating.
We followed recent papers like the Dalle-3 research paper, 2023 and setup an evaluation pipeline with the following properties:
- We sampled 9600 images per model (random, coreset, typicality).
- We presented two random images from two different models to the user and asked them to vote either for one of the images if there was a clear preference, or for “not sure” if there was no clear preference.
- We evaluate the win-rate of the different models to evaluate human preference.
We used a new web application solely developed for this sort of human evaluation called GenAIRater. We are still accepting additional votes under the following link :)
The results from the user study can be summarized in the following ranking for the different models:
- 1st place: The 1000 images typicality subsampled
- 2nd place: The 1000 images Coreset subsampled
- 3rd place: The 1000 images randomly subsampled
Visual Comparison
When working with generative AI models we can’t rely purely on metrics. Metrics can help us assess model convergence and to determine the diversity of the generated samples. A final visual quality check on randomly sampled images is a good practice. A user based rating where several participants rate different model outputs helps to identify which model actually creates the best data.
In the following we compare a random batch of 25 images sampled from the last checkpoint of the diffusion model training. For the full training dataset (11647 images) we sample the images after 70000 training steps. For the 1k training subsets we sample for 20000 steps as we noticed that the FID scores as well as other metrics would converge.
After inspecting carefully the various generated images we can draw the following conclusions:
- 1st place: Generated images from a model trained on the full dataset (11647 images) has biggest diversity and quality
- 2nd place: The 1000 subsampled typicality subset has a good balance between keeping diversity and generating real humans
- 3rd place: If we use 1000 randomly subsampled training images the sample diversity as well as the quality is slightly worse
- 4th place: The 1000 subsampled Coreset tries to keep a higher diversity but fails to generate real humans
Conclusion
In this post, we looked at various ways to subsample datasets for training generative AI models. Based on preliminary results there seems to be a similar trend for training diffusion models as with other deep learning methods. The old myth of, the more data you have, the better the model becomes does not hold true.
Igor Susmelj,
Co-Founder Lightly