Selecting the Most Typical Samples of Your Dataset
In recent years, we have seen a remarkable performance of Deep Neural Networks on a variety of machine learning tasks. However, these models require a large amount of training data. Labelling data can be extremely costly, especially if there is an expert required for the annotation. As a result, what is commonly done is to pre-train the Deep Neural Networks using a large amount of data that are not labeled and then fine tune them using a small portion of annotated data. The big problem to address in this case is: “Which data should I select for annotation from a large pool of unlabelled data, in order to optimally train my model?”. The answer is not straightforward.
At Lightly, after working with hundreds of organisations in helping them with data selection, we realised that in the low sample regime, plain diversity based methods reach their limits as they heavily focus on outliers. On the search for alternative algorithms, we focused on how typicality based selection can improve data selection.
Here is what you can expect to learn in this blog:
- What is the selection of samples based on their typicality.
- How is typicality selection different from selecting the most diverse data.
- What are the limitations of typicality based selection.
- Showcase of typicality based selection and evaluation on a classification task.
What Is Typicality Selection?
When selecting data from your pool of data, you must define a criterion based on which the selection will be performed. Typicality based selection aims to select the most characteristic (or typical) data of your sample distribution. Let us see how this is done.
What do we mean when we say that a sample is typical? We mean that there are many samples that are similar to it. So, if a sample is typical, we expect that there are many points that are close to it or, equivalently, that have a small distance from it.
Let us consider a sample x. We can find its K nearest neighbours in our sample set according to a distance d. The typicality¹ of the sample is then defined as:
You can observe that as the mean distance of a sample from its K nearest neighbours decreases, the typicality of that sample increases.
As a result, when you select data using the maximisation of their typicality as a criterion, you can expect to select samples from the high density regions of your sample distribution. Notice how, naturally, these will be the most typical samples of your distribution!
Typicality Selection vs Diversity Selection
It is also common to select a subset of data from a sample distribution with the objective to find a diverse cover of the dataset. Various algorithms have been proposed to that end. In Core-Set² the selection problem is solved in order to select the optimal subset of data to label and then use to train a Convolutional Neural Network. It is shown that optimising the core-set selection is equivalent to the k-Center problem, which aims to choose points (centers) such that the largest distance between a data point and its nearest center is minimised.
In the case where we apply a diversifying selection strategy directly on our raw data and we are interested in selecting a very small subset of this data, it is possible that we will end up selecting only edge cases. Consider for instance a dataset of images collected at fixed time intervals by a driving car. The majority of the images will be scenes of the road, while the car was in motion. However, there will be some images when the car was entering a parking lot, or when it was approaching a gas station. If we only select a small subset of our images using a diversifying criterion, we will most likely select only the images at the parking lot, the gas station etc. On the contrary, if typicality based selection would be applied in this case, we would most likely select only images of the car while it was in motion in the streets. This is due to the fact that typicality based selection will only sample from the high density region.
Limitations of Typicality Based Selection
Notice that this innate characteristic of typicality selection to only select samples from the high density regions can be a limitation in some cases. For instance, let us say that we have a data sample distribution that is composed of images of dogs and cats. Let us assume that we have more images of dogs than cats. In that case, it is very likely that the cluster of dog images will be much more densely populated. As a result, when we use typicality selection to select a small number of samples, it is very likely that we will only select images of dogs!
Combining Typicality With Diversity Selection
In order to make this more clear, let us consider a simple example of synthetic data. We generate samples from a multivariate normal distribution in two dimensions with independent components. We ensure that we have one significantly higher density region by selecting the variances of the first and the second component to be equal to 0.03 and 0.1 respectively. We then use Typicality, Diversity and Typicality-Diversity to select only 5 samples. We show in the Figure above in blue the initial samples and in red the samples that are selected by each selection strategy. You can see that the Typicality strategy will select all five samples from the high density region. On the contrary, Diversity will mostly select edge cases. Notice that in the case of Diversity, four samples are selected from low density regions. It is now clear that by combining Typicality and Diversity we get the most balanced selection, as illustrated in the right plot in the Figure above.
We can conclude that by combining diversity and typicality as our selection criteria we can can achieve a selected subset of our sample distribution that is composed of the most typical samples as well as edge cases, thus making our data selection optimal. In what follows, we will showcase this using the CIFAR10 dataset.
Experiments
We will now evaluate the different selection strategies. The experimental set-up is the following:
- We use the CIFAR-10 dataset³. The dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The figure below shows the 10 classes of the dataset, as well as 10 random images from each class.
- We train a Resnet18⁴ from scratch using only a small number of the labeled raining data. Specifically, we will consider the cases where the number of selected samples are in [500, 1000, 1500, 2000].
- The selected samples are chosen with two different selection strategies: (1) Diversity and (2) a combination of Typicality and Diversity strategies with equal weights.
- We train for 200 epochs using as an optimiser Stochastic Gradient Descent with momentum⁵.
- We use three different training seeds and report mean and standard deviation of the Macro F1 score in the test set.
In the Figure below we show the Macro F1 score that is achieved on the CIFAR-10 dataset as a function of the labeled sample size that was used for training. Notice that we only use a very small fraction of the 50000 available training images of CIFAR-10. The purpose of this experiment is solely to compare the selection strategies and not to train the best possible Resnet on CIFAR-10.
We can see that in all cases the selected samples using a combination of Typicality and Diversity leads to the best selected sample set. This is due to the fact that the combination of the two methods can lead to the most balanced selection that takes into account typical samples as well as edge cases.
In order to better illustrate this, in the Figures below we show 2D embeddings obtained with UMAP⁶ for the 500 selected samples with the two different selection strategies discussed. It can be seen that the embeddings of the samples selected with the diversity strategy do not give a good representation of the underlying distribution. The samples are selected to be as far away as possible from each other, but there is not a good indication of the structure of the complete sample distribution. On the other hand, the embeddings of the samples selected with the combination of typicality and diversity offer the most balanced subset, ensuring that a diverse set of samples is selected, while not focusing only on the most high density regions.
What is also interesting to look into is the distribution of the labels in the selected set. In the Figure below we show the histograms of the labels in the selected sets with the Diversity and the Typicality-Diversity strategies. It can be seen that the label distribution in the case of the Typicality-Diversity strategy is much closer to the uniform distribution compared to the label distribution in the case where the Diversity strategy is used alone. This is further quantified in the Table below where we show the Kullback–Leibler (KL) divergence between the class distribution histogram in the selected set and the uniform histogram. It can be seen that the KL divergence in the case of Typicality-Diversity is significantly lower compared to that of Diversity, which means that the labels in the selected set of the combined strategy are more balanced.
Conclusion
In this blogpost we discussed what typical samples are in a dataset. We showed how adding a typicality objective when selecting data to label from a large pool of unlabeled data can lead to increased performance of the trained machine learning model.
We explored and discussed through benchmarks how the addition of a typicality objective in selection increased the classification performance of a Resnet on CIFAR-10, compared to only using a diversity objective. Finally, we explained intuitively, and using the embeddings of the selected samples, why this is the case.
Do you want to try out typicality based selection yourself? Check out our docs of LightlyOne and make the most out of your data!
Effrosyni Simou
Machine Learning Engineer
lightly.ai
[1] Hacohen, Guy, Avihu Dekel, and Daphna Weinshall. “Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets.” International Conference on Machine Learning. PMLR, 2022.
[2] Sener, Ozan, and Silvio Savarese. “Active Learning for Convolutional Neural Networks: A Core-Set Approach.” International Conference on Learning Representations. 2018.
[3] Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.
[4] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[5] Sutskever, Ilya, et al. “On the importance of initialization and momentum in deep learning.” International Conference on Machine Learning. PMLR, 2013.
[6] McInnes, Leland, John Healy, and James Melville. “Umap: Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).