Brief Introduction to Contrastive Learning
Various families of Self-Supervised Learning methods have emerged in the past few years. One of the most dominant families is Contrastive Learning. Bromley et al. first introduced the idea of Contrastive Learning in 1993, along with the “Siamese” network architecture, in “Signature Verification using a Siamese Time Delay Neural Network” (NeurIPS 1993).
Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries. — Bromley et al. (1993)
Contrastive Learning has significantly impacted almost all forms of modern-day deep Learning, such as Unsupervised, Semi-/Self-Supervised, and fully Supervised Learning.
The core objective is simple: similar things should stay close while different things should be far apart. Contrastive Learning is often employed as a loss function during training. Let’s look at how different methods define contrastive losses.
Various Formulations of Contrastive Loss
Chopra et al. (2005) initially introduced contrastive loss in the paper “Learning a Similarity Metric Discriminatively with Application to Face Verification.” Chopra et al. tried using Energy-Based Models (Remember, the goal of Energy-Based Models is to minimize the energy and thus can be seen to be analogous to the overall loss during training) to build a face verification system using a discriminative learning framework. Still, they needed a contrastive term to ensure that the energy for a pair of inputs from the same category is low and that the energy for a pair from different categories is significant.
Thus, for a given pair of inputs, one tries to minimize the distance from the same class and maximize the distance if they are not (in some embedding space). If a function f maps a given input x to a vector and y denotes the label of the i-th sample.
- The first term in the loss function deals with items from the same class (same labels). We do this by minimizing the distance between any two samples.
- The second term, on the other hand, deals with items from different classes (samples with different y values). We try to maximize the distance between such samples about some epsilon, which is the minimum distance between distinct classes.
Early works like Hadsell et al. (2006) used the same formulation for several years.
Deep Metric Learning and Triplet Loss
Weinberger et al.(2009) introduced the Deep Metric Learning paradigm, in which a loss function is broken down into two terms: one that pulls target neighbours closer together and another that pushes differently labelled examples further apart. If alpha is a weighing term
The “pull” function penalizes large distances between each input and its target neighbours, while the “push” function penalizes small distances between differently labelled examples.
Later works, such as Schroff et al. (2015), introduced the triplet loss, adding a third term as an anchor. This anchor term belonged to the same class as the positive input and differed from the negative input. Thus, a model was trained during training to minimize the distance between the anchor term and the positive sample and increase the distance between the anchor term and the negative sample.
NOTE: Over the years, choosing the negative sample proved vital and led to other follow-up work that showcased various ways to select good negative samples (hard negative mining).
Multi-Class Contrastive Losses
Although vanilla contrastive and triplet losses led to promising results, they often suffered from a slow convergence rate. While hard negative mining could work, it increases training time due to searching for challenging samples. Sohn et al. (2016) proposed a solution to this problem by proposing a (N+1) multi-class tuplet loss, wherein a model learns to identify a positive sample from (N-1) negative samples. This can be seen as a generalization of the triplet loss. (triplet loss is a particular case where N=2). To avoid the scaling problems, the authors proposed an efficient batch construction method that only requires 2N examples.
Noise Contrastive Estimation and the InfoNCE Loss
Noise Contrastive Estimation (NCE) is a statistical paradigm wherein one learns to estimate a distribution without learning a full partition function (a partition function divides a distribution into multiple bins, analogous to classification). With a simple reformulation, this can be extended to estimate the mutual information between two samples. In particular, in its most naive formulation, NCE was used to differentiate between target samples and noise. Oord et al. (2018) extended NCE to a categorical cross-entropy loss to identify the positive sample amongst unrelated noise samples.
Thus, the InfoNCE loss optimizes the negative log probability of correctly classifying a given positive sample. The InfoNCE loss was extended to images by Henaff et al. (2020).
Contrastive Learning in Practice
In practice, Contrastive Learning often performs best when used in a Self-Supervised manner. But how does one do this without labels? This is done by creating multiple variants of a single image using a set of known semantic-preserving transformations (data augmentations). All the variants of a given sample become positive, and all others become negative.
Feature Learning at the Instance Level
Wu et al., in their CVPR 2018 paper titled “Unsupervised Feature Learning via Non-Parametric Instance Discrimination,” provided the first framework that formulated feature learning as a non-parametric classification problem at the instance level and used noise contrastive estimation to tackle the computational challenges imposed by a large number of instance classes.
The question behind the paper was: “Can we learn good feature representations that capture apparent similarity among instances, instead of classes, by merely asking the feature to be discriminative of individual instances?” They used a simple CNN backbone to encode each image into a vector representation, which, after some minor pre-processing (projecting into a lower dimension space and taking its L2 norm), is used to train a classifier to distinguish between individual instance classes.
SimCLR
Chen et al. (2020) proposed the most fundamental framework for using Contrastive Learning for images in their paper “A Simple Framework for Contrastive Learning of Visual Representations”. SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space.
This framework had some simple components:
- Given an input image, two “views” are generated using augmentations sampled from known semantic-preserving transformations, such as random colour distortions or random Gaussian blur.
- These views are then encoded into vectors using an encoder model. A typical choice for the encoder model is the backbone of a pre-trained image classification model.
- A small projection head then maps these representations into a latent space, where the contrastive loss is finally applied.
A random minibatch is sampled, and then augmented examples are generated. After generating augmented views for N samples, we’ll get 2N data points (2 views of each sample). Note that we don’t explicitly search for hard negative samples. Instead, given a positive pair, all other 2(N-1) samples are treated as negative pairs.
This loss is similar in form to the aforementioned (N+1) loss. Instead of using the dot product of the latent representations directly, this loss function employs a cosine similarity function. Another difference is the lack of a positive pair in the denominator. The literature often refers to this formulation as NT-Xent (Normalized temperature-scaled cross-entropy loss).
Barlow Twins
Along the same lines, Zbontar et al. (2021) proposed a similar framework. Still, instead of a similarity-based contrastive loss, they enforced the cross-correlation matrix between the features from augmented views to be close to the identity.
For a given pair of views, the loss function employed by Barlow Twins is as follows:
The first term here minimizes the invariance by equating the diagonal elements of the cross-correlation matrix to 1, thereby making the embedding invariant to the distortions applied. In contrast, the second term reduces the redundancy by trying to equate the off-diagonal elements of the cross-correlation matrix to 0, thereby de-correlating the various vector components of the embedding.
VICReg
The VICReg framework by Bardes et al. (2022) goes one step further with the framework introduced by SimCLR and Barlow Twins for training joint embedding architectures based on preserving the information content of the embeddings.
The loss function used in VICReg combines three terms: an invariance term that minimizes the mean-squared distance between the embedding vectors, a variance term that forces the embedding vectors of samples within a batch to be different, and a covariance term that de-correlates the variables of each embedding and prevents an informational collapse in which the variables would vary together or be highly correlated.
The final loss is a weighted average of the invariance, variance and covariance terms.
Common Problems in Contrastive Learning
A common flaw with Contrastive Learning is the difficulty of finding (often referred to as “mining”) hard negative samples to use in the loss function. Initial work involved intentionally selecting negative samples to be close to but distinct from the positives to form a more challenging learning objective. However, most recent research (Kalantidis et al. 2020, Tian 2022) has shown that large batch sizes, along with a careful choice of the loss function, inherently have mechanisms at the batch level, focusing on hard-negative pairs without explicit “hard-negative sampling”.
Summary
Contrastive Learning is a powerful approach that has gained significant traction recently. This method distinguishes between similar and dissimilar data points without relying on explicit labels. The fundamental idea is to push similar items closer together in a representational space while pushing dissimilar items further apart. Contrastive Learning often involves creating multiple versions of the same data point through various transformations or augmentations. These versions are treated as positive pairs, while unrelated data points are treated as negative examples. By learning to recognize which pairs are related and which are not, the model develops some form of understanding of the underlying structure of the data. The success of Contrastive Learning has led to significant improvements in the quality of learned representations, often rivalling or surpassing those obtained through traditional supervised learning methods.