Self-Supervised Learning at ECCV 2024

This year’s “Self-Supervised Learning—What is Next?” workshop at EECV 2024 is approaching. This article summarises the motivations and goals of all the papers featured at the workshop.

Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations — Presented by Neha Kalibhat

Self-supervised Learning can broadly be classified into two types: Generative and invariance-based. Invariance-based approaches involve joint-embedding pre-training with two or more views of the same input data sample.

Augmentations play a significant role in preventing collapse (representations for multiple views collapse to identical ones) while pre-training on images. They also introduce good inductive biases for downstream tasks. The authors demonstrate that increasing the “diversity-training aiming augmentations leads to improved performance, and removing them hurts performance. Removing random cropping shows the most vital reduction in performance (followed by grayscale) compared to the baselines, which retain all augmentations. Thus, the authors hypothesise that increasing augmentation diversity leads to better invariance and, therefore, better downstream performance.

Figure: Increased diversity in pretraining augmentations leads to better performance, and removing individual augmentations has detrimental effects on performance. Source: Kalibhat et al. (2023)

However, SSL has benefitted Audio and speech domains by maximising the mutual information between time and frequency formats in the latent space using classic Fourier Transforms and some format-specific augmentations. These transformations represent the same data under different coordinates. The contrasting of multiple formats (raw and frequency) of the same input is especially interesting even in the image space, as it potentially generates rich embeddings that encode both modalities. The authors aim to incorporate Fourier Domain Augmentations (FDA) inspired by audio and other temporal signals to improve overall augmentation diversity.

Proposed Augmentations

Figure: Proposed Fourier Mode Augmentations integrated with standard image augmentations like random cropping, colour jitter, grayscale etc. Source: Kalibhat et al. (2023)

The Fourier spectrum of an image is generated using the Fast Fourier Transform computed using the RFFT2D operation.

The authors propose the following augmentations that perturb different properties in the Fourier spectrum.

  1. Amplitude Re-scale: Using the amplitude of the spectrum, a uniform noise vector is generated and applied to each channel of the image’s FFT. When inverted, this results in non-uniform perturbations of the image colour space.
  2. Phase Shift: A randomly sampled constant shifting factor is applied to the spectrum phase, creating a movement effect in the image in which specific high-frequency attributes are brightened.
  3. Random Frequency Mask: A binary mask is applied across all the channels where some frequencies are randomly set to 0. This transform randomly turns off both high- and low-frequency modes across all channels, preserving the colour scope but resulting in a cloudy texture that is applied non-uniformly throughout the image.
  4. Gaussian Mixture Mask: A randomly sampled mask is generated using a random set of origins and standard deviations. A 2D kernel is drawn around each origin. This results in flexibly masks low and high frequencies, and the resulting images show unique textures containing both blurred and sharpened artefacts.

Results

Across all ablations, the authors observe that combining image and FDA while pre-training in the image domain results in the best downstream performance.

Results: Comparison of linear probing top-1 accuracy using standard SSL techniques and FDA augmentations. Source: Kalibhat et al. (2023)

The authors examine SimCLR, MoCo(v2), BYOL, and SimSiam using the ResNet-50 backbone and find that the FDA and existing image augmentations provide the best results.


Results: Few Shot and Transfer Learning performance using frozen encoders. Source: Kalibhat et al. (2023)

They also conduct experiments to study the individual effects of each part of the FDA pipeline. This paper opens several questions for future research.

  • Can there be better methods that can be utilised to encode the Fourier spectrum of an image?
  • How can we structure this Fourier input better to be used in encoders?
  • How does this perform in specialised domains such as Medical images or satellite imagery?

SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer — Presented by Renan A. Rojas-Gomez

While current SOTA augmentation techniques incorporate a wide range of colour, spectral, and spatial transformations, they often disregard an image’s natural structure. This leads to the augmented samples’ degraded semantic information, eventually impacting downstream performance. To this end, the authors introduce Style Augmentations for Self-Supervised Learning (SASSL) based on Neural Style Transfer to generate semantically consistent augmented samples.

These style augmentations disentangle an image into perceptual (style) and semantic (content) representations.

Proposed Method


Figure: Example Usage of Style Augmentations. Source: Rojas-Gomez et al. (2023)

Style Transfer combines a content image’s semantics with a stylised image’s semantics.

  1. Intermediate representations of the content and style image are generated using a feature extractor.
  2. An intermediate stylised image is generated using a convex combination of the intermediate representations using a blending factor.
  3. The final stylised output is obtained as a convex combination of the intermediate stylised image and the content image based on an interpolation factor.

Figure: Demonstration of the effects of the Interpolation Factor and Blending Factor. Source: Rojas-Gomez et al. (2023)

Generating Style References


Figure: Stylized images generated using style references from the same domain (in batch) and other domains (external). Source: Rojas-Gomez et al. (2023)

There are two ways one can leverage style references.

  1. External Stylisation: This involves using an external dataset as the style dataset.
  2. In-Batch Stylisation: This involves using the styles depicted in the content dataset and stylised samples using other images in a mini-batch. This enables the use of a single dataset for training aiming and stylisation.

Results

Results: Comparison of downstream Linear probing accuracy (%). Source: Rojas-Gomez et al. (2023)

This technique boosts top-1 classification accuracy on ImageNet by up to 2% compared to established self-supervised methods like MoCo, SimCLR, and BYOL while achieving superior transfer learning performance across various datasets.

VTCD: Understanding Video Transformers via Universal Concept Discovery — Presented by Matthew Kowal

For an overview of recent methods applying self-supervised techniques to video, refer to our article.

While video transformers share their architecture with image-based vision transformers, texture or semantic cues at a per-frame level cannot explain object tracking due to videos’ added temporal nature.

To this end, the authors present a Video Transformer Concept Discovery algorithm (VTCD) to interpret the representations of deep video transformers. They decompose the representation at any given layer into human-interpretable “concepts” without any labelled data (i.e. concept discovery) and then rank them in terms of their importance to the model output. They aim to explain the decision-making process of video transformers based on high-level spatiotemporal concepts automatically discovered during training.

Proposed Method

Code Available Here: YorkUCVIL/VTCD

Figure: Visual Representation of the proposed concept masking for a single concept. Source: Kowal et al. (2024)

Primer on Concept-Based Interpretability: Concept-based interpretability is a family of interpretability methods for understanding the representations a model utilizes for a given task. While a labelled dataset of concepts is sometimes available, such a dataset is often impossible for videos. On the other hand, unsupervised concept discovery uses clustering to group data into concepts within the feature space.

  1. Model Features from a given layer are clustered into spatiotemporal tubelets via Simple Linear Iterative Clustering, known as proposals.
  2. These proposals are then compared across the video to discover high-level concepts.
  3. Then, each of these concepts is ranked based on their importance.

Unlike images, videos have significantly more tubelets than image crops and, therefore, require a more efficient method for segmentation.

Figure: Overview of the Video Transformer Concept Discovery pipeline. Source: Kowal et al. (2024)

Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders — Presented by Renaud Vandeghen

Figure: Overview of SiamMAE (Siamese Masked Autoencoders). Source: Gupta et al. (2023)

The authors build on top of previous work done in SiamMAE, wherein a Siamese Encoder is used to process pairs of frames that are asymmetrically masked. However, SiamMAE had two fundamental limitations:

  • SiamMAE is designed only to process video frames, not images.
  • SiamMAE, while reducing the need for data augmentations, requires extensive training on large video datasets.

Proposed Method

Code Available Here: alexandre-eymael/CropMAE

Figure: Overview of the proposed CropMAE. Source: Eymaël et al. (2024)

  • Two augmented views are generated from an input image by randomly resizing, cropping and horizontally flipping the original image.
  • Both views are patchified, and one is masked using an extremely high mask ratio (98.5).
  • Both these views are then encoded using a Siamese ViT encoder
  • A Transformer decoder is used to reconstruct the target image.

The encoder and decoder are trained to minimize the L2 norm between the target view and the reconstructed image. After training, the decoder is discarded, and the encoder is used for downstream tasks.

Results: Comparison of SiamMAE with prior work on video object segmentation, human pose propagation and semantic part propagation. Source: Eymaël et al. (2024)

Deep Spectral Methods for Unsupervised Ultrasound Image Interpretation — Presented by Yordanka Velikova

Code Available Here: alexaatm/UnsupervisedSegmentor4Ultrasound

Interpreting ultrasound images presents several challenges, including inconsistent intensity levels, poor contrast, and intrinsic artefacts. Enhancing visualization methods can significantly improve image interpretation to overcome these obstacles, particularly in discerning anatomical structures and distinguishing between various tissue types.

Figure: Overview of the proposed segmentation pipeline. Source: Tmenova et al. (2024)

1. Spectral Clustering

  • Unlike standard images, ultrasound images lack distinct borders and diverse colours. Therefore, the authors use a pre-processing block consisting of standard techniques like Gaussian blurring, histogram equalisation, and pre-trained denoising models like MPRNet.
  • An essential step in spectral clustering is to treat image segmentation as a graph-cutting problem where images are graphs with nodes as either pixels or patches with edges representing similarity between nodes. The self-correlation of DINO features provides an effective affinity matrix, enabling successful graph partitioning and meaningful image segments.
  • Since colour affinities can’t be leveraged for greyscale ultrasound images, the authors build another patch-wise affinity matrix using the traditional sum of squared differences and mutual information.
  • Another position-based affinity matrix is created using linear interpolation of k-nearest neighbours of the SSD distance between feature vectors of any given patch.

A weighted combination of these three affinities creates a final affinity matrix. The Laplacian of this matrix is then used to compute eigensegments, which, when clustered, yield segmentation maps.

2. Semantic Clustering

The authors refine the segments further by enhancing the differences between features of segments, such as vessels and features of other areas with similar textures, while minimizing the overall segment count. This is done by employing a dual embedding strategy, wherein mask embeddings are constructed to capture shape features via binary masks, and positional embedding encodes the spatial locations of segments.

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning — Presented by Vishal Nedungadi

Project Page | Dataset | Code

Self-supervised Learning still needs to improve when evaluated on image domains other than ImageNet, such as real-world Satellite Imagery. Multi-modal data offers great potential for learning good semantic representations in such cases.

The authors aim to learn general-purpose representations for optical satellite images from the Sentinel-2 mission that predict various downstream tasks, including crop type, land cover, and climate zone classification. They extend on fully convolutional masked auto-encoders and extend the ConvNeXt-v2 MAE approach with multi-modal reconstruction tasks during pre-training.

Figure: Overview of the proposed MP-MAE archiecture. Source: Nedungadi et al. (2024)

Apart from introducing a new dataset, MMEarth, the authors propose modifications to the standard MAE architecture, focusing on ConvNeXt-v2 and Earth Observation Data. In particular, they focus on two fundamental changes:

  • Patch Size: The proposed MP-MAE architecture uses a reduced patch size of 16x16 but still preserves the original 7x7 patch layout. This adjustment is crucial as the number of patches is coupled with the optimal masking ratio.
  • Avoid Early Downsampling: In the traditional setting, the first layer is a learned downsampling layer. However, in MP-MAE, the authors replace it with a convolutional layer (kernel size 3 and stride 1) to learn feature maps at the input resolution.

They consider several pretext tasks and treat most modalities as individual tasks but group the climate variables and split latitude and longitude. A learnable mask token is used as a placeholder to reconstruct the masked patches when combining the embedding tokens to get a dense 2D input for the decoders.

All pretext task targets are reconstructed using the same random mask to prevent the model from learning shortcuts between the input and the targets. They also leverage a learned task-uncertainty weighing to reach a task-specific decoder, which aids in handling noisy targets, as task-specific weights naturally decrease for such pretext tasks.

ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders — Presented by Carlos Hinojosa

Code Available Here: carlosh93/ColorMAE

The authors introduce a novel, effective noise-filtering masking strategy that maintains the simplicity of random masking while also facilitating the Learning of more robust visual representations.

Figure: Overview of the proposed masking strategy. Source: Hinojosa et al. (2024)

Drawing from the colour noise literature in signal/image processing, a random noise array is generated, and some patches are selected according to the desired masking ratio. By filtering the random noise through low-pass, high-pass, band-pass, and band-stop filters, different noise patterns are produced, labelled as red, blue, green, and purple, respectively.

Figure: Demonstration of the various proposed filters. Source: Hinojosa et al. (2024)

This does not rely on external guidance or additional learnable parameters and maintains computational efficiency during training, similar to random masking. They empirically demonstrate that certain colour noises can significantly enhance visual representation quality during pre-training.

PART: Self-Supervised Training with Pairwise Relative Translations — Presented by Melika Ayouhi

Project Page | Code

Most Masked Image Modeling methods impose patches onto a grid. However, most real-world objects do not naturally align to a grid-like structure. Thus, the author develops a method to learn from randomly sampled patches (off-grid), where each patch can be in any position in the image. A regression-based objective thus emerges where we can model the relative relationships between randomly sampled patches solely based on their contents.

Figure: Overview of the PART sampling and objective. Source: Ayouhi et al. (2024)

The author introduces PART: Pairwise Relative Translation pre-training method that predicts relative translations between randomly sampled patches. Concretely, it is a regression task to predict the translation (∆x,∆y) between each pair of patches.

During random sampling, parts of the image are masked out. Also, some information about each patch’s spatial frequency is masked by resizing all samples to the patch size. The pretext task is set up such that the ViT model consumes images with incomplete information.

A pair of patches (reference, target) are sampled from the image at random positions. The goal is to learn the underlying translation between any pair of patches, i.e., to move the reference frame to translate into the target frame. The emphasis on predicting the relative translation is crucial because information about the pixel space is lost after resizing to a uniform patch size.

Figure: Illustration of PART on a sample image. Source: Ayouhi et al. (2024)

The author also introduces a cross-attention projection head module that distributes information between all patch representations and enables the model to focus on predicting the relative translation only for a subset of patch pairs.

SIGMA: Sinkhorn-Guided Masked Video Modeling — Presented by Mohammadreza Salehi

For an overview of recent methods applying self-supervised techniques to video, refer to our article.

The current paradigms for self-supervised pre-training focus on predicting small spatiotemporal units like their counterpart in text. However, this is a key limiting factor since, in the video domain, these small units or patches do not represent individual semantic units, while in the language domain, words or subwords do; therefore, the model is forced to learn to reconstruct them by learning low-level features.

To alleviate this, the authors propose a new framework wherein the typically predefined reconstruction target space can be learned alongside the video model. A projection network is introduced, which embeds both the visible and masked portions of the video, yielding deep feature reconstruction targets. The deep features of spatiotemporal units are regularised by uniform optimal transport across clusters. This acts as a high-entropy regularization constraint and enforces similar features to be assigned to the same centroid, infusing semantic meaning into the feature space.

Figure: Overview of the proposed SIGMA pipeline. Source: Salehi et al. (2024)

Project Page | Code

These cluster assignments and centroids are learned online using the fast Sinhkhorn-Knopp algorithm, yielding feature pseudo labels as targets. The loss objective is a symmetric prediction task, where the features from each branch — the video model and the projection network — cross-predict the cluster assignment of the other.

Figure: Overview of the proposed SIGMA architecture. Source: Salehi et al. (2024)

Instead of training an asymmetric encoder-decoder value to predict masked pixel values from video frames, the authors changed the target space to a feature space. Thus, the new projection network embeds the spatiotemporal units from video frames into intermediate features. Also, only those features are chosen and later used for masked prediction. Since the encoder-decoder model and the projection network are jointly optimized, a trivial solution emerges wherein all spatiotemporal units can be mapped to the same feature.

To avoid this, the authors constrain the feature space into limited clusters. Moreover, to achieve this in an online fashion, the features are mapped to a set of learnable prototype vectors. Due to the limited number of prototypes, similar and nearby spacetime tubes are assigned to the similar prototypes, infusing semantic spatial and temporal meaning into the feature space. The Sinkhorn algorithm is used to achieve this, generating pseudo-labels, which are then used as targets for the video model. The goal of the models is then to predict each other cluster assignments based on the projection of the features onto the shared prototypes.

Results: Comparison with classic video modeling methods. Source: Salehi et al. (2024)

UNIC: Universal Classification Models via Multi-teacher Distillation — Presented by Mert Bulent Sariyildiz

For an overview of the recent Knowledge Distillation techniques, refer to our article.

The authors aim to learn a universal encoder capable of generalization across a broad range of classification and dense prediction tasks. That way, a single pre-trained encoder can be used for several tasks with a small linear classifier per task.

They propose a new multi-teacher distillation strategy relying on multiple specialised teachers to train an encoder to surpass each teacher on their respective task. They also modify the projectors in a way that they can propagate the signal directly from intermediate layers to the distillation loss. They also propose a new strategy for balancing the effect of teachers in this multi-teacher scenario.

The Multi-Teacher Setup

Figure: Overview of the proposed Multi-Teacher Knowledge Distillation Setup. Source: Sarıyıldız et al. (2024)

We extend the single-teacher distillation framework to multiple teachers. Each teacher encodes an image into an intermediate representation. The goal is to train the student model to perform well at all the tasks that the teacher is good at. A projection head is added to the teacher to transform the representations into a teacher-specific representation. These heads are dropped after distillation and are thus expendable. They use a combination of cosine and smoothened L1 loss for training.

Projector Ladder

This basic setup allows us to inject teacher-specific parameters during distillation. However, recent literature has shown that using intermediate representations for knowledge distillation improves overall performance. Losses on intermediate representations should be added to prevent the overall objective from becoming cumbersome. The authors propose augmenting the existing projector heads to receive input from intermediate layers.

Teacher Dropping

Instead of devising a loss-weighing algorithm, the authors propose to “drop out” the losses for a subset of the teachers. This is based on the absolute magnitudes of the losses at the image level. The teacher with the maximum magnitude is never dropped, while all other teachers can be dropped with some probability.

Results: Relative gain on using UNIC. Source: Sarıyıldız et al. (2024)

SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery — Presented by Sarah Rastegar

Code Available Here: SarahRastegar/SelEx

Figure: Comparison of traditional contrastive learning with self-expertise. Source: Rastegar et al. (2024)

Building on prior work in Generalized Category Discovery, the authors propose a new approach that combines contrastive Learning with pseudo-labelling to uncover novel categories through “self-expertise.” They introduce a general Bayesian Network framework for discovering novel categories. Some data points will be labelled for a given dataset, while others will not. Moreover, the possible labels for the unlabelled are a combination of both known and novel categories.

Figure: The Bayesian Network framework used for category discovery. x represents distinct samples or alternative perspectives of the same sample, c represents the associated ground truth category variables, z represents the latent representation of the model for category variables, and y is the ground truth label. Source: Rastegar et al. (2024)

Contrastive training aims to estimate an equal distribution of ground-truth category random variables by minimizing the KL divergence between the actual and estimated distribution.

In Supervised Self-Expertise, one of the context variables is assumed to be accessible, while in unsupervised self-expertise, the distribution has to be approximated using the inputs alone.

Figure: The proposed self-expertise for generalized category discovery process. Source: Rastegar et al. (2024)

They start off by using pseudo-labels generated using a Hierarchical Semi-Supervised K-means algorithm and then further refine them by using unsupervised self-expertise to reformulate the target matrix.

Figure: The distinction between target matrices of unsupervised contrastive learning and self-expertise. Source: Rastegar et al. (2024)

Conclusion

The ECCV 2024 workshop on “Self-Supervised Learning: What is Next?” showcased various approaches addressing key challenges in data efficiency, model interpretability, and generalization across diverse domains.

Saurav,

Machine Learning Advocate Engineer

lightly.ai