Self-supervised learning trends and what to expect in 2023

A quick overview of the SSL standards, focusing on state-of-the-art research in masked image modeling, multi-modal models, CLIP, and local-level SSL.

This blog post will explore the recent advancements in self-supervised learning for computer vision and the potential trends that might persist in 2023. As an engineer working in this field at Lightly, I will discuss promising research directions in self-supervised learning while briefly presenting some of the most popular papers of 2022.

A little background about SSL

If you look at the top-1 accuracy of self-supervised learning approaches on ImageNet, you can easily see that, over the last three years, the growth in model performance has been massive. Many crucial developments have propelled the development of self-supervised learning, enabling the creation of effective models without the need for labeled data.

Self-supervised learning officially started with Colorful Image Colorization (2016), a pioneering work from UC Berkley, which proposes a CNN backbone used to guess a plausible color version of a B/W photograph. The goal of the architecture was to create an algorithm capable of passing the colorization Turing test.

The first revolution in the field was led by Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination (2018), which introduced contrastive loss and significantly increased the model performances.

The accuracy of models using Self-Supervised Learning has plateaued. Photo courtesy of https://paperswithcode.com/sota/self-supervised-image-classification-on.

After this publication, new models like BYOL (Bootstrap your own latent: A new approach to self-supervised learning, 2020) or SimCLR (A Simple Framework for Contrastive Learning of Visual Representations, 2020) showed for the first time outstanding classification results, comparable to what was achievable only with supervised learning.

The accuracy of the models skyrocketed between late 2019 and early 2021, with a lot of research productivity mainly focused on contrastive methods and siamese architectures. You can find here a great lecture from Facebook AI Research (FAIR) researchers about the most popular papers they published during those years.

Recent trends in self-supervised learning

The accuracy of these models plateaued in 2021. The problem of SSL classification is, to some extent, solved on standard datasets like ImageNet. Now the question is whether the models perform well in downstream tasks. However, new approaches are emerging that promise to push the boundaries of self-supervised performance.

For example, the Vision Transformer (ViT), introduced in 2020 by An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), has been a widely used architecture in recent papers. The second trend I see today is that more and more models conjugate different data types, such as data2vec (data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, 2022). Another prevailing architecture that is being incorporated a lot in SSL is CLIP (Contrastive Language–Image Pre-training, 2021), a tremendously large model trained with natural language supervision. The last trend I see in research is the rise of many papers focusing on pre-training models with local features to perform better on different tasks than image classification. I will now break down these four different research directions.

Masked image modeling

One of the most exciting papers published at the end of 2021 was Masked Autoencoders Are Scalable Vision Learners, 2021 (MAE). This work, taking inspiration from NLP, has led to a little revolution in the field. The authors discovered they could train a powerful SSL model by masking a high proportion of the input image (75%) and passing the patches to an encoder-decoder network using vanilla ViTs.


MAE Architecture from Masked Autoencoders Are Scalable Vision Learners. Both the encoder and decoder are vanilla ViTs.

In contrast to classic ResNet architectures, there is no need for negative contrastive samples. It also inherently builds meaningful patch-wise embeddings that can be further accessed.

This new approach yielded a novel network architecture that set new SOTA accuracy on ImageNet-1K. The method has shown good transfer performance in downstream tasks, outperforming existing self-supervised pre-training models.

The simplicity of the architecture has undoubtedly been an inspiration for successive works like, for example, SimMIM: A Simple Framework for Masked Image Modeling (2021). This paper retraces the work from MAE and proposes an even simpler model consisting of a ViT encoder and a linear decoder.

Masked Siamese Networks for Label-Efficient Learning architecture.

Another outstanding paper in this direction is Masked Siamese Networks for Label-Efficient Learning, 2022 (MSN). The authors found that combining a siamese approach with the MAE concept sets a new state-of-the-art for self-supervised learning on ImageNet-1K. They train a teacher-student network using gradient descent and EMA weight update. They also introduced prototypes to avoid collapse.

I see a lot of potential in this field of research. Masked approaches have been game-changing in natural language processing, and the vision transformer is still an emerging backbone in computer vision. From 2023, I expect new training techniques for masked models and enhanced versions of the ViT itself.

Multi-modal models

Self-supervised learning is similar across different modalities, but the specific algorithms and goals vary depending on the task. Data2vec presented for the first time a framework that applies the same learning method to speech, natural language processing, and computer vision.

This paper is more in the direction of a proof-of-concept, showing that a single network can be trained for each modality. The model consists of the same architecture with hyperparameters adjusted depending on the mode. This strategy has the potential to significantly simplify the development and deployment of AI systems, making them more versatile and adaptable to different use cases.

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. It is a teacher-student architecture handling images, speech, and language.

Combining image and video inputs with Masked Autoencoders led to OmniMAE: Single Model Masked Pretraining on Images and Videos (2022). This robust yet straightforward architecture learns visual representations comparable to or even better than single-modality representations on both image and video. They trained a model masking up to 95% of the video features using MAE with a positional encoder on image features and time. Doing so enabled high-speed training and fast convergence.

Multi-modal models are revolutionizing the field of AI by allowing the use of a single network for multiple modalities. This approach shows how simple architectures can be versatile and adaptable to different use cases. In my opinion, the multi-modal models will be the protagonists in shaping the future of AI.

The advent of contrastive loss in CLIP

OpenAI’s CLIP model has revolutionized the field of computer vision. The research in this direction capitalizes on the model’s huge capacity. This property is the key to using CLIP as a supervising model for existing architectures.

For example, it is the case of Detecting Twenty-thousand Classes using Image-level Supervision (2022). This work has shown that CLIP supervision helps training and makes zero-shot object detection possible. This architecture inherits a classical object detector using region proposals networks (RPNs) and uses a weighted loss between RPNs, bounding box regression, and CLIP classification. The resulting model generalizes well, making detecting thousands of classes finally feasible.

Detecting Twenty-thousand Classes using Image-level Supervision. The model is trained using data from a standard object detector and supervision from CLIP.

CLIP was also used to train self-supervised models, as in the case of SLIP: Self-supervision meets Language-Image pre-training (2021). This work introduces a multi-task learning framework combining self-supervised learning and CLIP pre-training. In this case, CLIP is combined with SimCLR, resulting in a better-performing model than one based on sole self-supervision or language supervision.

Relying on architectures such as CLIP for supervision significantly impacts model performance. Transferring the learning of CLIP to a smaller capacity task-focused model can make both SSL and SL models improve in generalization and accuracy. I am excited to see forthcoming integrations of these high-capacity models.

Self-supervised pre-training on local features

When it comes to SSL, usually, the models perform nicely on image classification, while they struggle to generalize well in other downstream tasks like object detection or image segmentation.

Unlike image classification, which focuses on identifying the overall content of an image, these two tasks require a more fine-grained understanding of the image. Local-level self-supervised learning methods, which focus on learning representations of small regions within an image, have proven to be particularly effective for these tasks.

One of the pioneers in this direction was Dense Contrastive Learning for Self-Supervised Visual Pre-Training (2020). The authors explored how working with a contrastive loss on pixel level can improve model performance on object detection and semantic segmentation.

Results from Dense Contrastive Learning for Self-Supervised Visual Pre-Training. Working on dense local features improves AP and mIoU on object detection and semantic segmentation.

Efficient Visual Pretraining with Contrastive Detection (2021) is another stimulating work focusing on local features to solve the computational bottleneck of big self-supervised models. It introduces the computation of the loss on local features extracted with local heuristics or efficient segmentation algorithms like the Felzenszwalb-Huttenlocher(2004). Doing so makes pre-training up to five times faster, leading to state-of-the-art transfer accuracy.

The architecture proposed by VICRegL: Self-Supervised Learning of Local Visual Features.

A recent publication exploiting local features is VICRegL: Self-Supervised Learning of Local Visual Features (2022). The proposed architecture combines global and local criteria based on features from the convolutional encoder and image embeddings, using the loss from VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning (2021). Combining local and global features demonstrates strong performance on segmentation transfer tasks.

I am looking forward to seeing how this sub-branch will evolve. I hope the research focus on self-supervised pre-training will shift from solely working on image classification to solving the tasks of object detection and semantic segmentation.

Going further

If you want to play around with the models, many of these papers are implemented in our self-supervised open-source repository LightlySSL. Please take a look at the code and feel free to contribute!

Niccolò Avogaro,
Solution Engineer at Lightly