SimCLR Explained: The ELI5 Guide for Engineers
One of the most influential papers on Self Supervised Learning is SimCLR by Chen et al. (2020). Self-supervised learning has proved to be a valuable strategy compared to traditional methods like supervised learning, which are limited by their dependency on labeled data. This severely limits their use in high-impact domains like medical imaging. For most medical imaging tasks, such as Whole Slide Identification or detecting artefacts in X-rays, datasets are not available at the scale of current models (most recent VLMs and transformers have > a billion parameters). Moreover, labelling is even more expensive in these domains, thereby making Supervised Learning almost impossible (fine-tuning might still be possible but not full-scale pre-training).
Self-supervised learning shines in such label-deficient industries.
Without the dependency on labels, one can exploit the web-scale collection of raw samples. At its core, Self-Supervised Learning defines a pretext task that enables learning rich representations from unlabeled data. These pretext tasks can take various forms, but perhaps the simplest and most influential is the task of restoring missing or distorted parts of an input. The most common form of this is the Masked Language Modelling underpinning the success of transformer-based models in Natural Language Processing.
Services like ChatGPT and Claude rely on billion-parameter transformer models that are pre-trained on such tasks. Words are randomly masked in a sentence, and the model is trained to predict the surrounding words. This forces the model to learn relationships between words without any explicit labels, such as entity tagging. Moreover, representations learnt in this manner can be used for several tasks ranging from summarization and translation to text generation (how we use LLMs these days).

In computer vision, models can be trained to predict image patches or contrast augmented views of the same image against others—forcing them to learn spatial relationships and invariant features.
This contrastive approach helps group similar representations while distinguishing different ones, resulting in robust, transferable vision models—much like Masked Language Modeling in NLP.
In this article we’ll systematically study and cover the following:
- What is SimCLR and Contrastive Learning Basics
- Data Augmentations in SimCLR: Generating Augmented Views
- Contrastive Loss (NT-Xent): The Loss Function in Latent Space
- SimCLR Training Procedure and Tips
- Evaluating Self-Supervised Representations
- SimCLR vs. Supervised Learning: Key Differences and Insights
- SimCLR Conclusion and Further Development
And if you’d like to have a visual explanation of SimCLR, we encourage you to check out this video:
Now, let’s dive in.
What is SimCLR and Contrastive Learning Basics
SimCLR is a landmark framework for contrastive learning of visual representations that helped push self-supervised performance to new heights. It was introduced by Chen et al. in 2020.

Developed by the Google Brain team (including Geoffrey Hinton), SimCLR demonstrated that with the right training strategy and some major components – simple augmentations, a contrastive loss function, and big batches – a model can learn image features almost as well as if it had full supervision. This was a compelling breakthrough at a time when models depended heavily on millions of labeled images for training. The SimCLR framework is part of the broad family of models known as the Deep Metric Learning family, which includes other prominent methods like NNCLR.
At its heart, SimCLR uses a contrastive learning objective: the model is trained to maximize agreement between different augmented views of the same image while minimizing agreement between views of the same example
In other words, if you take an image and create two altered versions of it, SimCLR will learn to output similar embeddings (feature vectors) for those two images. Meanwhile, embeddings of two different images should be far apart (dissimilar) in the latent space unless those images are visually similar by coincidence.
This approach is often described as an instance discrimination task: each individual image instance is treated as its own class. The only “label” the model gets is that two augmented images came from the same original (so they belong together) while other images in the batch are different. This forms the basis of a contrastive prediction task– given an image representation, the model must identify which other representation in the batch is the matching augmented view (the positive) among many negatives.
Data Augmentations in SimCLR: Generating Augmented Views
Data augmentation is at the heart of SimCLR.
In SimCLR, two views of the same data example are generated by randomly sampling augmentations from a known family of “semantic preserving” transformations. In the case of SimCLR, two views are generated using either random resizing, cropping, colour jittering or random blurring.
An Encoder network is then used to create encodings of these augmented data examples. A typical choice for an encoder network is a traditional image backbone such as ResNet-50. Having generated such encodings of the two views of the same image, a projector network then maps representations into another representation space. A simple MLP network is chosen to act as the projector network. These resulting representations are then used to train the model using a contrastive objective.
Key data augmentation techniques
The SimCLR paper found that a specific combination of simple image augmentations was crucial for learning good representations. The three primary augmentations used (applied sequentially) are:
- Random cropping and resize: Randomly crop a portion of the image and resize it back to the original resolution.
- Random color distortion: Apply random changes to color properties (brightness, contrast, saturation, hue). Each augmented view gets its own color jitter, so that two views of the same image can have very different color distributions.
- Random Gaussian blur: Slightly blur one of the images (especially effective for images from high-resolution datasets).
Why data augmentation matters
Since we are attempting to learn without labels, where does the signal during training come from?
Because we are using a Contrastive Learning objective, the simple task of enforcing representations of different views of the same image to be similar forms the signal since there is some notion of two objects being similar (pseudo label).
Contrastive Loss (NT-Xent): The Loss Function in Latent Space
The evolution of SimCLR is closely tied to the emergence of Contrastive Loss. Here are the basics:
For a given pair of inputs, one tries to minimize the distance from the same class and maximize the distance if they are not (in some embedding space). If a function f maps a given input x to a vector and y denotes the label of the i-th sample.

- The first term in the loss function deals with items from the same class (same labels). We do this by minimizing the distance between any two samples.
- The second term, on the other hand, deals with items from different classes (samples with different y values). We try to maximize the distance between such samples about some epsilon, which is the minimum distance between distinct classes.
Variants of the same input image are called positive examples, while views of different images are called negative examples.
Given a dataset of unlabelled images, a random mini-batch is sampled, and augmented views are generated. After generating augmented views for N samples, we’ll get 2N data points (2 views of each sample). Note that we don’t explicitly search for hard negative samples. Instead, given a positive pair, all other 2(N-1) samples are treated as negative pairs.

This loss formulation is similar to the (N+1) loss typically used in Contrastive Learning. Instead of using the dot product of the latent representations directly, this loss function employs a cosine similarity function. The literature often refers to this formulation as NT-Xent (Normalized temperature-scaled cross-entropy loss).
With the availability of “positive” and “negative” samples, one might also consider employing a Triplet Loss. This formulation involves a margin parameter, say m, which enforces a minimum distance between the views of different images.

SimCLR Training Procedure and Tips
Let’s review some example code showcasing how to train a model using the SimCLR framework. We’ll be using LightlySSL - a computer vision framework for self-supervised learning.
LightlySSL is a modular framework that exposes low-level building blocks, such as loss functions and model heads, in an easy-to-use API. It is written in a PyTorch-like style. It also supports custom backbone models for self-supervised pre-training and distributed training using PyTorch Lightning.
You can install the package from PyPI using pip as follows:
pip install lightly
LightlySSL offers simple abstractions and model implementations for 10+ Self-Supervised Learning frameworks. Using some helper modules from the library, we can create a simple PyTorch Module that implements the SimCLR framework.
from lightly.models.modules import SimCLRProjectionHead
class SimCLR(nn.Module):
def __init__(self, backbone):
super().__init__()
self.backbone = backbone
self.projection_head = SimCLRProjectionHead(512, 512, 128)
def forward(self, x):
x = self.backbone(x).flatten(start_dim=1)
z = self.projection_head(x)
return z
- We import the SimCLRProjectionHead module from the models’ submodule, which allows us to implement the “projector” network.
- This implementation takes a backbone model as input but is flexible and allows using any model class as the backbone.
- In the forward method, we pass the input to the backbone model and flatten the output to generate intermediate representations
- These intermediate representations are then passed to the projection head to generate the final embeddings for a given input.
Let’s use the resnet18 model from torchvision as our backbone and instantiate a SimCLR model.
import torchvision
from torch import nn
resnet = torchvision.models.resnet18()
backbone = nn.Sequential(*list(resnet.children())[:-1])
model = SimCLR(backbone)
LightlySSL also implements various augmentations that are usually used in self-supervised learning methods. For SimCLR, we can use the predefined SimCLRTransform class.
from lightly.transforms.simclr_transform import SimCLRTransform
transform = SimCLRTransform(input_size=32, gaussian_blur=0.0)
dataset = torchvision.datasets.CIFAR10(
"datasets/cifar10", download=True, transform=transform
In this example, we see that the transforms available in LightlySSL are compatible with torchvision datasets
What about training? That can also be greatly simplified.
from lightly.loss import NTXentLoss
criterion = NTXentLoss()
for epoch in range(10):
total_loss = 0
for batch in dataloader:
x0, x1 = batch[0] x0 = x0.to(device) x1 = x1.to(device)
z0 = model(x0)
z1 = model(x1)
loss = criterion(z0, z1)
total_loss += loss.detach()
loss.backward()
optimizer.step()
optimizer.zero_grad()
avg_loss = total_loss / len(dataloader)
- We use the NTXentLoss implementation available in the loss submodule
The entire training code can be found in lightly/examples/.
If you are using PyTorch Lightning, LightlySSL integrates seamlessly with Lightning Modules.
class SimCLR(pl.LightningModule):
def __init__(self):
super().__init__()
resnet = torchvision.models.resnet18()
self.backbone = nn.Sequential(*list(resnet.children())[:-1])
self.projection_head = SimCLRProjectionHead(512, 2048, 2048)
self.criterion = NTXentLoss()
def forward(self, x):
x = self.backbone(x).flatten(start_dim=1)
z = self.projection_head(x)
return z
def training_step(self, batch, batch_index):
(x0, x1) = batch[0]
z0 = self.forward(x0)
z1 = self.forward(x1)
loss = self.criterion(z0, z1)
return loss
def configure_optimizers(self):
optim = torch.optim.SGD(self.parameters(), lr=0.06)
return optim
trainer = pl.Trainer(max_epochs=10, devices=1, accelerator=accelerator)
trainer.fit(model=model, train_dataloaders=dataloader)
For more detailed reference code and pretrained models, refer to the LightlySSL docs.
Pro Tip: For many use cases, you don’t actually need to train SimCLR from scratch. Refer to our tutorial on fine tuning lightly checkpoints.
You can download these and fine-tune or evaluate them on your own task.