A Guide for Active Learning in Computer Vision
Learn how active learning can be used to build a data flywheel where only data is getting labeled and used for training that actually matters.
Before jumping right into the steps to select data using active learning, we will have a look at what active learning actually is.
What is Active Learning?
Active learning is a research field in machine learning (ML) that aims to reduce costs and time to build new machine learning solutions by querying the next data for your pipeline in an intelligent manner. When developing new AI solutions and working with unstructured data such as images, audio or text, we often require the data to be annotated by humans before we can use them for training our models. This data annotation process is very time-consuming and expensive. It’s typically one of the biggest bottlenecks in modern ML teams.
In our journey at Lightly, we talked to over 200 ML teams in the computer vision field. Most don’t use sophisticated active learning strategies yet and rely on random selection. Selecting data randomly has the advantage that it does not change the distribution of the data. However, this is also under the premise that the input data matches the distribution you actually care about.
Different Active Learning Approaches
When doing active learning, we typically use the predictions of models. Whenever your model makes a prediction, you also get the associated probability of the prediction. Since models are inherently bad in knowing their own limit, we try to make use of other tricks in research to overcome these limitations. We could for example not only consider a single model but a group of models (ensemble). This gives us more information about the actual model uncertainty. If the group of models all agree on the predictions, the uncertainty is low. If they all disagree, the uncertainty is high. But having multiple models is very expensive. Papers like “Training Data Subset Search with Ensemble Active Learning, 2020”, use between 4 and 8 different models for ensemble methods.
We can be more efficient by using Monte Carlo dropout, where we add dropout between the last layers of our model. This allows us to use a single model to create multiple predictions (using Dropout) similar to using model ensembles. However, this has the downside that we need to change the model architecture and add dropout layers.
Using Embeddings in Active Learning
Recently, papers also started using embeddings. With embeddings, we can get a feeling for how similar the different samples are. In computer vision, we could for example use embeddings to check for similar images or even similar objects. We can then use a distance metric such as Euclidean distance or cosine similarity in the embedding space and combine this with the uncertainty of the prediction.
Using embeddings and predictions from the very same model however has the drawback that both rely on the same features learned by the model. Typically, the embeddings are the output of the model one or a few layers before the predictions. To overcome this limitation, we started using embeddings from other models than what we call the “task” model. The task model is there the actual model you would like to improve using active learning.
Our own benchmarks and experience working with dozens of companies across autonomous driving, satellite imaginary, robotics and video analytics suggest that using models trained using self supervised learning have the most robust embeddings. Recent models such as CLIP or SEER are both using self supervised learning. We already summarized in another blog post that these self supervised learning models are more robust and fair.
What can I expect when using Active Learning?
First, be aware that active learning is a tool and as most other tools you use, you will have to fine-tune some parameters to get maximum value out of it. After extensive research and trying to replicate many papers from recent active learning research, we observed that these basic rules seem to hold for what we consider to be “good” training data:
- Choose diverse data — having diverse data (diverse images, diverse objects) is the single most important factor
- Balance your dataset — Make sure the data is balanced across your modalities (weather, gender, time of the day)
- Don’t worry too much about model architecture — Based on our own experiments, it looks like good data for a large ViT model is also helpful for a small ResNet
The first two points suggest that we should aim at getting diverse data from all modalities and in equal amounts. The third point is nice to know. It means, that we can select training data with a model today, and we can still reuse the same data in a year when we train a completely new model. Please note, that these are just observations. If implemented correctly, active learning can improve model accuracy significantly.
We evaluated the performance of combining AL, diversity and balanced selection on the task of detecting problems in salmon filets. The goal was to improve model accuracy for “Hematoma” of Salmon, as this is the most crucial class.
The company started with 20'000 images and had budget to select 1'000 new images, once using their existing method “random sampling” and once using a more sophisticated approach. Using a combination of diversity, prediction uncertainty and class balancing as part of their active learning strategy, the company was able to almost improve the F1 score for that crucial class by 100% compared to random selection. Overall F1 score (“General”) increased by 10% compared to randomly selecting images.
Use Active Learning in your next Computer Vision Project
You have two options here. Either you start implementing your own framework based on Papers and GitHub repositories, or you use existing active learning solutions like LightlyOne.
Implement Active Learning Algorithms from Scratch
Let’s start with implementing active learning yourself. In its simplest form, we can just focus on the predictions of our model. We could just create predictions for a new unlabeled dataset, compute a score and sort all samples based on the score.
The advantage of this approach is that it requires only little work. We can use a simple entropy scorer (we look at the prediction entropy). To compute the entropy of a discrete random variable, we can use the following formula:
In Python code, this would translate to the following snippet:
Now we have a single score. How about doing the same for object detection? We can compute the entropy for each prediction and then aggregate the metrics per image, since we are interested in ranking the images for labeling.
We can then also create a scorer that uses embeddings to consider image diversity. And another one to consider our desired distribution of the metadata. As we go further, we discover a few things:
- we have to write many scorers for every new input or task we want to solve
- we need a way to easily switch between scorers to evaluate and keep track of which ones work best
- we need a scalable solution as we might have much more unlabeled data at hand
- we need strategies in order to combine different scorers
Use Active Learning Solutions
Doing active learning yourself from scratch becomes its own engineering project. We also want to make sure that the algorithms work and if new papers with even more promising methods appear we want to include and benchmark them.
Finally, we also want metrics in order to evaluate the selected data before spending a ton of money on the data labeling process.
Instead of building your own active learning solution, you could use a platform like LightlyOne. The platform can help you process large amounts of unlabeled data with sophisticated data selection algorithms and without sharing your data. LightlyOne is used by leading machine learning teams that want to build an automated data flywheel that allows them to scale operations without having to build their own tools or grow their operations.
And here’s the best thing. You can even try it out for free!
Igor Susmelj,
Co-Founder Lightly