Introducing What To Label - data preparation

This post was published on: https://medium.com/whattolabel/introducing-what-to-label-e0ca03a1c65c

Are you curious about research areas such as active, self-supervised, and semi-supervised learning and how we can optimize datasets rather than optimizing deep learning models? You’re in good company, and this blog post will tell you all about it!

In this post, you will get to know more about our journey as an emerging company in this new field and our learnings on why and how we can improve deep learning models by focusing on dataset optimization. However, since enough blogs and tutorials are already covering all aspects in the fields of architecture search, hyperparameter optimizations, or similar topics, we won’t talk about those here.


Illustration showing a common problem with data annotation: Where should I start?

Should I spend time on optimizing my dataset?

A lot of recent research has been focused on fixing a dataset and treating it as a benchmark for various architectures or training and regularization strategies. Recent papers such as Billion-scale semi-supervised learning for image classification or Self-training with Noisy Student use pre-training on larger datasets to boost test accuracy on the famous ImageNet dataset. Still, the focus is on the architecture or the way of training.

We propose another area of research that is less explored — fixing architecture and training methods but varying the training data.
Let me illustrate the process: A common dataset such as cifar10 or ImageNet is selected, and then the test and validation set is fixed. In contrast to the aforementioned method, here, the focus lies on how subsampling or oversampling the existing training set can affect the performance on the validation respective test set. In other words, we add or remove samples from the training set, and then different methods are compared.

Randomly removing samples as a baseline

We now have a defined task — comparing different sampling strategies and benchmarking them on various datasets. What we are missing is a baseline to compare our method against. The task itself is similar to active learning. Assuming we start with a small fraction of the training dataset and iteratively add batches based on high model uncertainty, we build a dataset of highly relevant samples. Typically, active learning methods are compared to random subsampling.

One recent paper in active-learning from 2019 is Discriminative Active Learning. They compare various methods against their new method. As typically the case for active-learning those approaches rely on a repetitive train-predict-label loop. Label in the context of dataset filtering would be equivalent to the sampling process. In other words, the dataset is built up step by step adding more and more samples. In the illustration below such a step contains adding 5000 samples to the training set.

Plot showing results from discriminative active learning

A plot from Discriminative Active Learning showing various active learning methods

Active-learning is today's common practice. However, there are quite a few downsides with current implementations. Can we do better?

Using self-supervised learning for subsampling

At WhatToLabel, we want to go beyond the current active-learning and solve the two following existential problems:

  1. Active learning faces a cold-start problem since it needs an initially trained model that requires a labeled dataset.
  2. Iterative train-predict-label loops make it very slow for larger datasets since the model needs constant retraining.

We believe that by leveraging recent advancements in feature learning — using self-supervision and novel techniques we developed in-house — we can address both problems.

The plot below shows the first results stemming from more than a year of extensive experimentation and research. We were able to develop a method to subsample datasets better than random, without labels, and without a train-predict-label loop as in active learning. Each of those operating points (% of training dataset size) can be computed without any human intervention within less than an hour for 100'000 samples.

Filtering result comparison whattolabel against random subsampling
What’s next?

Extensive evaluation of our method with existing active learning approaches. Many engineers asked us how our results compare to their existing active learning pipeline. That’s why we devote part of our engineering resources to benchmark our methods on various datasets against other common active learning strategies.

WTL (WhatToLabel) separates more informative samples from less informative samples.

We’re excited about building such an essential infrastructure for this AI revolution and for what is to come! If you have any questions, feel free to comment below or reach out to us.

Igor, co-founder
whattolabel.com