Using Self-Supervised Learning for Dense Prediction Tasks

Self-Supervised Learning (SSL) has made significant strides in computer vision, particularly for image classification tasks. However, most SSL methods are sub-optimal for dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. This performance gap stems from fundamental differences in how these tasks process and interpret visual information.

Traditional SSL methods focus on learning global representations of images, which work well for classification tasks where the goal is to assign a single label to an entire image. In contrast, dense prediction tasks require fine-grained, localized information about every pixel or region in a picture. This discrepancy leads to several challenges:

  1. Image classification tasks typically use image-level representations, while dense prediction tasks need pixel-level or region-level information. Standard SSL methods often fail to capture the necessary local details.
  2. Many SSL techniques are trained on datasets that prioritize centered, prominent objects. However, dense prediction tasks often deal with complex scenes containing multiple objects of varying sizes and positions.
  3. Dense prediction tasks require a deep understanding of spatial relationships and context within an image, which may not be fully captured by global representation learning.

Researchers have recently developed SSL methods specifically designed for dense prediction tasks to address these issues. These approaches aim to bridge the gap between Self-Supervised pre-training and the requirements of object detection, instance segmentation, and semantic segmentation.

In this article, we will explore the challenges of applying SSL to dense prediction tasks and discuss how these new methods adapt SSL techniques to capture local features, understand complex scenes, and provide the detailed spatial information necessary for effective dense predictions.

Brief Overview of Contrastive Methods

While there are a lot of families of Self-Supervised Learning models, recent breakthroughs have come from models that formulate learning as an image-level prediction task using global features. One popular method in this family is called SimCLR introduced in Simple Framework for Contrastive Learning of Visual Representations, 2020.

Figure: Overview of the SimCLR model architecture. Source: The Illustrated SimCLR Framework by Amit Chaudhary

Given an unlabelled dataset of images, random “views” are generated using data augmentation strategies for each image. These random views are fed into an encoder (backbone + projection head).

Since these “views” originate from the same image, their corresponding representations must be similar in the latent space. Also, views from different images should be dissimilar to each other. Thus, a contrastive objective can be formulated wherein we try to assimilate representations from the same image together and try to push them away from representations of different images.

Contrastive Learning As A Look-Up Task

Contrastive Learning can also be considered as a dictionary look-up task. For each encoded query, there is a set of encoded keys among which a single positive key matches the query. These encoded queries and keys are generated from different views.

For an encoded query its positive key encodes different views of the same image, while the negative key encodes views of different images. A contrastive loss function InfoNCE is then employed to pull the query close to the positive key while pushing it away from other negative keys:

Equation: Contrastive Loss. q represents the query, k+ represents the positive key, k- represents the negative key, and tau is the temperature parameter  

This approach has shown impressive results, often performing nearly as well as models trained on labeled data for certain tasks. However, as we’ll discuss later, there are still challenges when applying these methods to dense prediction tasks like object detection or image segmentation.

Why don’t standard methods work well?

While recent Self-Supervised Learning (SSL) methods have shown impressive performance on image classification tasks, often matching or surpassing supervised learning approaches, they have struggled when applied to dense prediction tasks such as object detection and image segmentation.

Representation Granularity

Most SSL methods are designed to learn global image representations, which work well for classification but are insufficient for dense prediction tasks. Classification tasks require features that represent the overall content of an image. However, dense prediction tasks need features that contain precise localization information. For object detection or segmentation, the model must determine for each pixel whether an object is present and what type of object it is. This requires learning features that preserve spatial information and object boundaries.

Figure: Overview of recent SSL methods. Notice how most of them work with image-level representations. Source: https://arxiv.org/abs/2105.04906

Object centric datasets

Most SSL methods are evaluated on object-centric datasets like ImageNet and JFT, where each image typically contains one prominent object. When using random crops or augmentations of these images, it’s often assumed that different views of the same image will contain the same object. This assumption helps the model learn consistent features for a single object which works well for classification tasks but not for dense tasks. Moreover, these datasets don’t reflect the complexity of real-world scenes that dense prediction tasks often encounter, where multiple objects of varying sizes and positions are present.

Recognizing these challenges has led to the development of new SSL approaches specifically designed for dense prediction tasks, which we’ll explore in the following sections.

DenseCL

Wang et al. in Dense Contrastive Learning for Self-Supervised Visual Pre-training, 2021 introduced a new pre-text task, Dense Contrastive Learning (DenseCL) an extension of classical contrastive methods which calculates the loss on local features. This can be seen as an extension of SimCLR with the loss being applied to both global and local representations. This method has negligible computation overhead compared to prior methods and fills the gap between Self-Supervised pre-training and dense prediction tasks.

Figure: Illustrated Dense Contrastive Learning. Source: https://arxiv.org/abs/2011.09157

DenseCL views the Self-Supervised Learning task as a dense pairwise contrastive learning task rather than the global image classification.

  1. Introduce a dense projection head that takes the features from a backbone network as input and outputs dense feature vectors.
  2. Define the positive sample of each local feature vector by extracting the correspondence across views.
  3. Design a dense contrastive loss extending the conventional InfoNCE loss to a dense paradigm.

They extend and generalize the contrastive framework to a dense paradigm. The DenseCL projection head consists of two parallel sub-heads for global projection vectors and dense projection vectors. The global projection head is the same as the standard ones from the literature while in the dense projection head, the global pooling layer is removed and the identical 1x1 convolution layers replace the MLP. The backbone and two parallel projection heads are end-to-end trained by optimizing a joint pairwise contrastive (dis)similarity loss at the levels of both global and local features.

In the case of DenseCL, each query no longer represents the whole view, but rather a local part of a view. Each negative key is the pooled feature vector of a view from a different image. The positive key is assigned to the extracted correspondence across views using a cosine similarity matrix. The most similar key to the query is chosen as the positive key. The dense contrastive loss is defined as:

Equation: Dense Contrastive Loss from DenseCL. r represents the encoded query, and t represents the encoded keys.

The overall objective is then:

Equation: Overall objective function for DenseCL

The authors report improved performance on dense prediction tasks such as Object Detection, Instance Segmentation, and Semantic Segmentation.


Figure: Comparison of DenseCL and related methods on Semantic Segmentation. Source: https://arxiv.org/abs/2011.09157

DetCon

Hénaff et al. in Efficient Visual Pre-training with Contrastive Detection, 2021 introduced a new objective DetCon which maximizes the similarity of object-level features across augmentations. It is an extension of DenseCL but instead of matching features based on similarity, it precomputes the segmentation masks leading to computational efficiency.

Figure: Overview of the DetCon objective function and method. Source: https://arxiv.org/abs/2103.10957

This objective maximizes the similarity across views of local features that represent the same object and has threefold benefits:

  1. It extracts separate learning signals from all objects in an image (Object-level features are simply obtained from intermediate feature arrays)
  2. It provides a larger and more diverse set of negative samples to contrast against.
  3. The objective is much better suited to learning from complex scenes with many objects, a pretraining domain that has proven challenging for self-supervised methods.

Much like SimCLR and BYOL, the authors produce two views of an image. In addition, they compute segmentation masks for each image using off-the-shelf unsupervised segmentation algorithms. These segmentation masks are passed through the same set of augmentations resulting in a pair of augmented masks. These masks are then aligned with the augmented views using a contrastive objective.

Equation: Contrastive loss between the latent representations of augmented masks. v denotes the latent representations of the masks denoted by m

The authors also include negative samples from different masks in the image and different images in the batch. A natural extension of this loss would be to jointly sample paired masks that correspond to the same region in the original image, and maximize the similarity of features representing them. The authors make some practical changes and reformulate that as:

Equation: Overall loss objective for DetCon. The binary variable indicates whether the masks correspond to the same underlying region.

Notably, the authors report that the DetCon framework leads to impressive performance in fewer iterations when compared to existing methods.

Figure: Depiction of how DetCon is an efficient method for Self-Supervised pre-training for dense prediction tasks. Source: https://arxiv.org/abs/2103.10957

SelfPatch

Yun et al. in Patch-level Representation Learning for Self-Supervised Vision Transformers, 2021 introduced a new visual pretext task coined as (SelfPatch) for learning better patch-level representations tailored to Vision Transfomers (ViTs) for utilizing their unique architectural advantages. They enforce invariance against each patch and its neighbors i.e. each patch treats similar neighbor patches as positive samples. This is motivated by the assumption that adjacent patches often share a common semantic context.

Figure: Illustrated SelfPatch architecture. Source: https://arxiv.org/abs/2206.07990

ViTs can process patch-level representations, but existing pre-text tasks used in the existing SSL schemes only use the whole image-level self-supervision without considering learning patch-level representations. As a result, existing SSL ViTs may fail to capture semantically meaningful relations among patches.

The most common way to apply SSL to ViTs is via constructing a positive pair by applying various augmentations to an image and enforcing invariance between them, i.e. different views of the same image should be mapped similarly in the latent space. A generic formulation of this idea is as follows:

Equation: Typical Self-Supervised Loss function for ViTs

where

  • D is the distance function being used to compare the outputs
  • g represents a projection head
  • f represents a ViT backbone
  • sg represents the stop gradient operation
  • alpha and beta are the parameters of the two models

The SelfPatch method can be summarised as follows: For each patch, “positive matching” finds a set of candidates for positive patch indices from its neighborhood and then the “aggregation module” aggregates their representation. Let’s dive into each one of these in detail.

  1. Neighbouring Patches: Given a query patch we assume that there exists a neighboring patch in the neighborhood. The authors simply used adjacent patches as the neighborhood.
  2. Matching Patches: To sample a positive patch from a given query patch, the authors measure the semantic closeness between all neighboring patches of the query patch. The authors use cosine similarity on the representation space to gauge the similarity between patches and take the top-k patches.
  3. Aggregation Module: The authors use an aggregation module to construct patch-level representations for each patch. A key point here is that this aggregation module is used not only for patch representations but also for image-level representation.

The Overall Loss Objective for SelfPatch is as follows:

Equation: Overall objective function for SelfPatch

NOTE: The SelfPatch loss has an asymmetric form since the query representation does not use the aggregation module while it’s target class does. The authors empirically showed that this asymmetry along with the stop-gradient operation avoids mode collapse during training of the patch-level representations.

The authors represent great performance when used in conjunction with methods like DINO.

Figure: Comparison of SelfPatch with related methods on Objection Detection and Segmentation. Source: https://arxiv.org/abs/2206.07990

VICReg(L)

Initially introduced in VICReg: Variance-Invariance-Covariance Regularization in Self-Supervised Learning, 2022 the VICReg architecture has emerged as a simple yet effective method in the Canonical Correlation Analysis (CCA) family of models. These models aim to learn similarities between any two data points by analyzing their cross-covariance matrices.

Figure: Overview of the VICReg architecture. Source: https://arxiv.org/abs/2105.04906

Initially introduced as a means to prevent a collapse in which the encoders produce constant or non-informative vectors, VICReg tries to balance the variance, invariance, and covariance between the latent representations of two views of an image. Unlike other methods that we have discussed so far, VICReg:

  • does not require that the weights of the two branches be shared, not that the architectures be identical, nor that the inputs be of the same nature.
  • does not require a memory bank, contrastive samples, a large batch size
  • does not require batch-wise or feature-wise normalization
  • does not require vector quantization, or a predictor module

The composite loss function is then a weighted combination of the three individual terms: variance (forces the embedding vectors of samples within a batch to be different), invariance (forces embeddings from the same image to be similar), and covariance (decorrelates the variables of each embedding and prevents an informational collapse)

Equation: Loss term between two embedding vectors. s, v, and c are the regularization terms for invariance, variance, and covariance terms. λ, μ, and ν and are the scalar coefficients weighing the individual terms.

Equation: Loss function for VICReg.

Moreover, with impressive performance on dense prediction tasks such as object detection and retrieval tasks, VICReg is an explicit and effective, yet simple method for preventing collapse in Self-Supervised joint-embedding learning.

In the follow-up work, VICRegL: Self-Supervised Learning of Local Visual Features, 2022 the authors explored the fundamental trade-off between learning local and global features. They introduced a new method VICRegL that learns good global and local features simultaneously, yielding excellent performance on dense prediction tasks while maintaining good performance on classification tasks.

Figure: Overview of the VICRegL architecture. Source: https://arxiv.org/abs/2210.01571

VICRegL focuses on convolutional networks as the encoder of choice and operates on the unpooled representations generated from two views of the same image. The main idea is to apply the VICReg criterion between pairs of feature vectors by matching elements of the vectors, using spatial and L2-distance-based information.

After generating embeddings from the encoder, the unpooled representations are fed into a separate local projector head while the average pooled representations are fed into the main VICReg projector head. This now leads to two separate losses.

  1. Location-Based: This loss matches all coordinates of an embedding and its closest neighbor coordinate in the other embedding. This is based on the spatial information provided by the transformation between the views.
  2. Feature-Based(Global Criterion): This loss matches features that are close in the embedding space. Each coordinate’s embedding is matched with its closest embedding produced from the other transformation. in terms of the L2 distance.

The final loss function is a combination of the location-based and feature-based loss functions, which form the local criterion, with in addition a standard VICReg loss function applied to the pooled representations, which is the global criterion. Both location and feature-based loss functions are symmetrized, because, for both, the search for the best match is not a symmetric operation.

Figure: Comparison of VICRegL with related methods on Linear Classification and Segmentation. Source: https://arxiv.org/abs/2210.01571

The authors report significant performance gains on dense prediction tasks using the modified VICRegL architecture without significant drops in image classification.

DropPos

Wang et al. introduced DropPos in Pre-Training Vision Transformers by Reconstructing Dropped Positions, 2023. This method aims to enhance the spatial awareness of Vision Transformers (ViTs) through a clever Self-Supervised task. The core idea is simple yet effective: the model learns to predict the original location of image patches within the full image.

By challenging the model to determine where a patch belongs in the larger image context, DropPos forces the ViT to develop a strong understanding of spatial relationships and visual cues.

Figure: Comparison between Contrastive Learning, Masked Image Modeling, and DropPos as pretext tasks. Source: https://arxiv.org/abs/2309.03576

Since a Vision Transformer operators on patches along with positional embeddings, simply dropping a random subset would then force the model to learn to reconstruct the position for each patch. This naturally becomes a qualified location-aware pretext task. Additionally, to avoid trivial solutions they increase the difficulty of the task by keeping only a subset of patches visible and use position smoothing and attentive reconstruction strategies to relax this classification problem. Moreover, unlike other methods, it doesn’t rely heavily on strong image augmentations since it’s training for a simple reconstruction loss.

Figure: Illustration of the DropPos paradigm. Source: https://arxiv.org/abs/2309.03576

First, they mask a large random subset of input images. Then, positional embeddings of visible patches are randomly dropped and [MASK] tokens are introduced. A lightweight projector is adopted to reconstruct those dropped positions. A simple classification objective is applied to the unmasked patches.

  • Position Smoothing: Since in this paradigm, the different positions are not completely independent, the classification problem is relaxed when the model has predictions close to the actual position. In particular, the various positions are smoothened using a weight matrix that measures similarities between different patches.

Equation: Position smoothing based on the relative distance between two positions. The authors use Euclidean distance as the distance measure.

The entire weight matrix is then normalized by neighboring positions.

Equation: smoothed ground truth values.

  • Attentive reconstruction: Since there may be different patches that share a similar visual appearance, it is not necessary to reconstruct their exact positions. Simply swapping these patches still maintains reasonable visual coherence. An extra attentive term is added to the overall loss function which computes the attentive similarities between encoder features of patches.

Equation: Attentive weights for a given patch i.

Figure: Comparison with other methods on downstream tasks. Source: https://arxiv.org/abs/2309.03576

NaViT: Patch n’ Pack

Dehghani et al. in Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, 2023 challenged the assumption of working with fixed resolution images. They leverage Vision Transformer’s (ViTs) ability to do flexible sequence-based modeling and vary input sequence lengths. They introduce NaViT (Native Resolution ViTs) in which multiple patches from different images are packed in a single sequence which enables variable resolution while preserving the aspect ratio.

Figure: Illustrated example of variable resolution packing. Source: https://arxiv.org/abs/2307.06304

This simple change during data preprocessing enables NaViT to consistently outperform ViT at a fixed computational budget. The position-wise operations in the network, such as MLPs, residual connections, and layer normalizations, do not need to be altered.

This empowers ViTs to transcend limitations imposed by current data and modeling pipelines, enabling ideas that were previously restricted by the constraints of fixed batch shapes, and unlocking new possibilities for innovation and advancement.

Figure: NaViT offers improvements in Object Detection as well. Source: https://arxiv.org/abs/2307.06304

Conclusion

In this article, we’ve explored various Self-Supervised Learning (SSL) methods designed specifically for dense prediction tasks such as object detection, instance segmentation, and semantic segmentation. These methods address the limitations of traditional SSL approaches, which often struggle with tasks requiring fine-grained, localized information.

The key insight behind these dense SSL methods is their focus on generating representations that contain more precise localization information. In particular,

  1. DenseCL extends contrastive learning to local features, applying the loss to both global and local representations. This approach helps the model learn spatially-aware features without significant computational overhead.
  2. DetCon uses pre-computed segmentation masks to match object-level features across augmentations, making it particularly effective for complex scenes with multiple objects.
  3. SelfPatch is tailored for Vision Transformers and enforces invariance between neighboring patches, helping to capture semantically meaningful relationships among image regions.
  4. VICRegL balances the learning of both global and local features by applying variance, invariance, and covariance regularization at different scales.
  5. DropPos enhances spatial awareness in Vision Transformers by challenging the model to predict the original locations of image patches, fostering a deep understanding of spatial relationships.
  6. NaViT (Patch n’ Pack) allows for flexible input resolutions and aspect ratios, enabling more effective processing of varied image sizes and shapes.

These methods demonstrate that by explicitly incorporating localization objectives into the Self-Supervised learning process, we can significantly improve performance on dense prediction tasks. They achieve this by applying contrastive learning at the patch or object level, leveraging spatial relationships between image regions, incorporating position prediction tasks, and balancing global and local feature learning.

For practitioners, the choice of method may depend on the specific architecture and task at hand. ResNet-based models can easily incorporate techniques like DenseCL, VICRegL, and DetCon with minor adjustments to the training process. Vision Transformers, on the other hand, may benefit more from approaches like SelfPatch, DropPos, and NaViT.

Importantly, many of these improvements can be implemented with minimal computational overhead on top of existing pipelines, offering significant gains in downstream performance on dense prediction tasks.