Active Learning Strategies Compared for YOLOv8 on Lincolnbeet
Learn how different data selection strategies impact model accuracy. We use the lincolnbeet dataset and YOLOv8 model for our experiments.
Agriculture is one of the domains that could benefit a lot from recent breakthroughs in computer vision. Having machines that can analyze millions of crops throughout the year to optimise yield and minimise the required amount of pesticides that are required has a big impact!
We show how ML teams can save up to 77% of labeling costs or improve the model by up to 14.6x per additional labeled batch when using active learning compared to random selection!
We take a closer look at one application of computer vision in agriculture: Using robots equipped with cameras to optimise precision spraying of weeds on large fields of crops. In this example we use the lincolnbeet dataset and set out with the goal of building a reliable computer vision system.
This showcase aims to illustrate how using a smart data selection strategy like active learning yields significant benefits compared to random selection. We show how ML teams can save up to 77% of labeling costs or improve the model by up to 14.6x per additional labeled batch when using Active Learning compared to random selection!
For benchmarking different data selection strategies, we will use LightlyOne. LightlyOne has built a scalable active learning solution that can be easily plugged into any existing computer vision pipeline. We showcase different built-in strategies to select data for the object detection task of the lincolnbeet dataset evaluated using the YOLOv8 model. You can get started using LightlyOne for free.
Dataset
The lincolnbeet dataset consists of 4 402 full-hd images with a total of
39 246 objects in them. The dataset consists of two classes sugar beet and weed plantswith bounding boxes. The two classes are almost equally represented with 16 399 (42%) sugar beets and 22 847 (58%) weed plants. Since we have on average almost 10 objects per image with a rather high image resolution the cost of annotating this data can be very high.
Our goal is to use an active learning feedback loop where we iteratively label a bit of data, train a model and then pick the next batch for labeling based on the model output. Our goal is to get to a high accuracy with less than 400 annotated images.
- training set size: 3 089 images
- validation set size: 441
- test set size: 883
We analyze in more detail how different selection strategies can impact the selected data.
Experiments
For our baseline model, we pick 200 images randomly from the training set. It’s also possible to pick the initial 200 images already using LightlyOne. But since our focus is to show how the various selection strategies can improve an existing dataset, we fix the initial dataset to be a random subset of 200 images.
We use two different seeds of the YOLOv8 model for all further experiments. The plots, therefore, show the standard deviation additionally.
The exact code we use to train all of the YOLOv8 models can be found below. We train for 50 epochs with a batch size of 8. We additionally use random vertical flip (flipud) augmentation and increase the input image size to 960 pixels to work better on small objects. We keep these parameters the same and only vary the data (which training set we use) and seed parameter during our experiments.
To evaluate the model based on a checkpoint, we can use the following CLI command:
Let’s see how the baseline model performs. We show the initial performance of our baseline YOLOv8 model trained on 200 random images.
We also show some example images of what the model sees during training and the validation step. As you can see, the training images are heavily augmented. The default augmentations range from random color change, and resize to mosaic augmentations and more.
For each of the experiments we perform the following steps:
- Run LightlyOne to select a subset of the data based on the selection criteria
- Update the YOLOv8 training set to use the newly selected data
- Train the YOLOv8 model
- Evaluate the YOLOv8 model
- Repeat steps 3 and 4 using another seed
For step 2, we need a way to sync the selected data from LightlyOne with the dataset in the YOLO format. We can do this using the following code:
Note that we only update the training set .txt file. YOLOv8 uses a .yaml file that tracks which set is used for training and validation. We can now just copy this yaml file and only change the training set.
Random
As the name suggests, we simply randomly select 400 images in this experiment. You find the LightlyOne Selection config to pick 400 images randomly here:
ALL
ALL refers to the selection strategy used in the LightlyOne YOLOv7 tutorial. Note that all the other strategies we compare against are ablations of the ALL strategy. This strategy contains the following elements:
- We use embeddings to find diverse images. The embeddings are computed based on the cropped images using the bounding boxes of our YOLOv8 model.
- We train our embedding model directly on the cropped images and not on the full frames.
- We use balancing to get the target ratio of 30% sugar beet and 70% weed. We pick a 30/70 ratio because the initial model seems to struggle more on the weed objects and we want to oversample these cases.
- We use the predictions to prioritize images with many objects — crowded scenes using the frequency scorer built-in to LightlyOne.
- We use the prediction probability (objectness least confidence) to get images that are more likely around the decision boundary
We present the detailed selection config we used for selecting 400 images. You will often find code using “yolov8-random-200-detection”. This is the task containing the predictions of our YOLOv8 model trained on the initial 200 images.
ALL-w/o-freq
Using the frequency scorer is a simple way to oversample crowded scenes. This also brings its drawbacks. If we pay per single annotation, crowded scenes will be more expensive to annotate. We, therefore also use the ALL strategy without the frequency scorer.
ALL-w/o-balancing
Using object predictions and their predicted classes, we can estimate the ratio between the classes in the not-yet-annotated dataset. We can use that information directly in LightlyOne to set target ratios. This can be very helpful if we care more about a less present class than a frequent one. For this dataset, we wanted to oversample weed using a 70 / 30 ratio. But this constraint also impacts the overall selected dataset. This experiment removes the balancing goal.
All-train-on-images
Our base strategy, ALL, trains the embedding model directly on the object crops. What if we train on the full frames instead? Because we have roughly 10 objects per frame we train for 10 epochs instead of one to make sure the model has seen more or less the same number of samples during training.
ALL-train-on-images-w/o-freq
Same as before, we train on images instead of crops. But this time we also make sure we remove the frequency scorer.
ALL-train-on-images-img-embedding
Remember that we can change the way we train and embed images. We can train on images or crops, and then we can embed them on images or crops. In total, we have four options. As we don’t expect to gain much from training on objects but embedding frames (there would be lots of context missing), we cover the case of training on frames and embedding frames for the diversity criteria.
The following code implements training and embedding on images (and not crops):
Metrics for Evaluation
For all experiments, we compute the mAP50 and mAP50..95 using the built-in metrics from YOLOv8. The metrics are computed similarly to the COCO benchmark.
We use the LightlyOne API Client to fetch the subset of filenames selected with the corresponding selection strategy. We can do this by comparing the filenames of the selected 400 samples and only keeping them in the YOLOv8 training list.
Furthermore, we also report metrics such as the number of objects annotated in the selected training data. This can be useful for us to compute the cost of the selected data and put it in perspective to the gain in mAP.
We show the results for mAP50.
We also show results for mAP50..95:
Number of objects in the newly labeled set
We also analyze how many objects have been labeled based on the selection with the various methods. This can give us insights into the costs that arise with the gain in accuracy.
The following table shows the number of objects of the two classes.
We can also compute the gain in mAP of the different methods compared to our baseline experiment (randomly selecting 200 images). For example, randomly selecting 400 instead of 200 images yields an average gain of 0.25% in mAP50 and 0.85% in mAP50..95.
Finally, we can also compute the number of additionally annotated objects. For example, the ALL method resulted in 400 images and 12 059 objects being selected vs. 400 images and 3 472 that were selected with the random method. The gain in accuracy comes with a high price; let’s quantify it.
In the table below, you find the number of newly annotated objects that result in a gain of 1% in mAP50 or mAP50..95. Interestingly, most methods outperform random selection meaning that we can save $$$ by using any of them. Another interesting insight is that the methods, including the frequency scorer, are the most expensive ones. We get a great boost in mAP but also many new objects to label.
Assuming costs of $0.05 per bounding box, we end up with different costs to increase our model performance by 1 mAP. For example, the strategies without the frequency scorer are more cost-efficient as fewer objects are annotated and can save up to 77% in mAP50 and 35% in mAP50..95 in label costs.
Conclusion
In this post, we looked at different strategies supported by LightlyOne to select training data and their impact on model accuracy. Furthermore, we evaluate the methods in terms of accuracy (mAP) and cost efficiency when it comes to annotation costs.
Igor Susmelj,
Co-Founder Lightly