Self-Supervised Learning for Videos

Self-supervised learning has emerged as a good alternative to supervised learning in recent years. It has been shown to beat Supervised Learning on Image Classification benchmarks and is a great option in cases where annotating/labeling is too expensive. However, its impact and performance on videos still need to be investigated since videos have an inherent multidimensional nature and complexity since they have both spatial and temporal dimensions.

In this article, we will briefly overview Masked Autoencoders as applied to images in a self-supervised setting, discuss why videos need special attention, and review the VideoMAE architecture and its follow-up work.

Brief Overview of Image Masked Autoencoders (ImageMAE)


Figure: Overview of the ImageMAE architecture. Source: https://arxiv.org/abs/2111.06377

The ImageMAE architecture introduced by He et al. in Masked Autoencoders Are Scalable Vision Learners, 2022 was inspired by the success of Masked Modelling in NLP and was based on a straightforward idea:

An image is converted into a set of non-overlapping patches and masked randomly. The visible subset of patches are fed into an encoder which projects them into a latent representation space. A lightweight decoder then operates on these latent representations and the masked tokens to reconstruct the original image.

Two key design principles of this approach are:

  • An asymmetric design in the sense that the encoder only operates on the visible tokens while the decoder operates on the latent representations and masked tokens.
  • A lightweight decoder is used to reconstruct the image.

Reconstruction Loss: The decoder is tasked with reconstructing the input image, and therefore predicts pixel values for each masked patch. Thus a natural loss formulation emerges where we compute the Mean Squared Error (MSE) between the reconstructed images and the original images in the pixel space. The authors also report improved performance when normalizing the predicted pixel values of each masked patch.

Equation: Reconstruction Loss. “M” represents the set of masked pixels.

Moreover, based on extensive studies the authors find that this method works well even with high masking proportions (>75%) and improves efficiency in training high-capacity models that generalize well. The authors use Vision Transformers (ViTs) as the encoders. Because the encoder only operates on visible patches of the image (~25% of the original image) this enables them to train very large encoders with only a fraction of compute and memory.

Figure: Validation Accuracy of ImageMAE on ImageNet-1k with varying masking ratios. Source: https://arxiv.org/abs/2111.06377

Why Video Requires Special Attention

Temporal Redundancy

Videos are often densely captured with a high refresh rate and therefore their semantics vary slowly over time. This phenomenon termed Temporal Redundancy leads to issues when applying masked modeling to videos.

  • Keeping the original frame rate for pre-training is inefficient since consecutive frames are highly correlated and mostly redundant for representation learning.
  • Under standard masking ratios because of temporal redundancy reconstruction is simple because it’s mostly the same scene in every frame.

Temporal Correlation

Videos can be seen as the evolution of a scene over time with correspondence between the consecutive frames. This correlation leads to information leakage during the reconstruction process.

Thus for a given masked part of the video (termed cube), it becomes easy to find an unmasked highly correlated copy in adjacent frames. This might lead to the model learning “shortcut” features that don’t generalize to new scenes.

The VideoMAE Architecture

To overcome these unique properties of applying Masked Modeling techniques to videos Tong et al. introduced a new method in VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, 2022. VideoMAE is a simple strategy that not only effectively increases the pre-training performance but also greatly reduces the computational cost due to the asymmetric encoder-decoder architecture. The models pre-trained with VideoMAE significantly outperform those trained from scratch or pre-trained with contrastive learning methods.


Figure: VideoMAE architecture. Source: https://arxiv.org/abs/2203.12602

This is a simple extension of ImageMAE but with two key properties:

  • Temporal Downsampling: The authors propose to use a strided temporal sampling strategy to obtain better frames for efficient Self-Supervised Video Pre-training (SSVP). Formally, a video clip with consecutive frames is randomly sampled from the original video. Temporal sampling is then used to compress this clip into fewer frames using strided temporal sampling.
  • Cube Embedding: The authors use a joint space-time cube embedding where a cube represents a 3D token with three dimensions height, width, and time. These cubes are mapped to a token in the channel dimension. This decreases the spatial and temporal dimension of the input and helps alleviate the temporal redundancy.
  • Extremely High Masking: To deal with temporal redundancy, the authors decide to use extremely high mask ratios (90–95%) to mitigate the information leakage during masked modeling. This masking strategy enforces a mask to expand over the whole temporal axis. This masking strategy enforces temporal neighbors of masked cubes to always be masked.

The following Gradio Space visualises the masking process of VideoMAE

Joint Space-Time Attention

The authors use a Joint Space-Time Attention policy to learn representations across tokens. However, this has a downside since similarity is computed for all pairs of tokens which is computationally costly due to the large number of patches in a video clip.

Instead of applying attention to the spatial domain within each frame, the authors use joint space-time attention to learn temporal dependencies across frames.

Figure: Comparison of various Space-Time attention policies. Blue patches denote the query patch while the other non-blue colors are the self-attention space-time neighborhood for each scheme. Source: https://arxiv.org/abs/2102.05095

Several follow-up works have investigated the impact of space-time attention variants on video understanding tasks. Bertasius et al. in Is Space-Time Attention All You Need for Video Understanding?, 2021 proposed a more efficient architecture for spatiotemporal attention, termed Divided Space-Time Attention where temporal attention and spatial attention are separately applied one after the other.

Results From VideoMAE

The authors conduct extensive experiments and report that the VideoMAE architecture is a data-efficient learner for Self-Supervised Video Pre-training. Notably, even with only 3.5k training clips, VideoMAE obtains satisfying accuracy on the HMDB51 dataset, thus proving its effectiveness on limited data.

Figure: Comparison of VideoMAE (proposed method) and MoCo v3 (leading contrastive learning method).

  • Compared to training from scratch and the leading contrastive methods, VideoMAE performs significantly better even with significantly fewer video clips.
Figure: VideoMAE vs MoCo v3 in terms of efficiency and effectiveness on Something-Something V2

  • VideoMAE outperforms MoCo v3 in terms of both fine-tuning and linear probing accuracy with a 3.2x speedup.

Follow Up Work

VideoMAEv2

Figure: Overview of VideoMAEv2. Source: https://arxiv.org/abs/2303.16727

In a follow-up work, Wang et al. propose a dual masking strategy for VideoMAE in VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking, 2023. They further increase the efficiency of the VideoMAE model by applying a masking map to the decoder as well. The model then learns to reconstruct a subset of pixel cubes selected by the running cell masking.

This better enables large-scale VideoMAE pre-training under a limited computational budget by using the decoder mask to reduce the decoder input length for high efficiency yet attain similar information to the full reconstruction.

As opposed to VideoMAE which requires pre-training individual models specific to each dataset, the authors aim to learn a universal pre-trained model that could be transferred to different downstream tasks.

Figure: Encoder-only masking vs Dual masking. Source: https://arxiv.org/abs/2303.16727
  • This dual masking strategy outperforms encoder-only masking in terms of both performance and training time.


MGMAE

Figure: Comparison of various masking strategies. Source: https://arxiv.org/abs/2308.10794

Huang et al. in MGMAE: Motion Guided Masking for Video Masked Autoencoding, 2023 introduced a new motion-guided masking strategy that explicitly incorporates motion information to build temporal consistent masking volume. This is based on the insight that motion is a general and unique prior in the video, which should be taken into account during masked pre-training.

The optical flow representation explicitly encodes the movement of each pixel from the current frame to the next one. This is then used to align masking maps between adjacent frames to build consistent masking volumes across time. In particular, the authors use an online and lightweight optical flow estimator to capture motion information.

Figure: Overview of MGMAE. Source: https://arxiv.org/abs/2308.10794

Firstly, a masking map is randomly generated at the base frame (by default, middle frame). Estimated optical flow is then used to warp the initial masking map to adjacent frames. As a result of multiple warping operations, a temporally consistent masking volume is built for all frames in the video. Based on this masking volume, a set of visible tokens to the MAE encoder with top-k selection is sampled based on a frame-wise manner.

Figure: MGMAE vs VideoMAE on SSV2. Source: https://arxiv.org/abs/2308.10794

With improved accuracy, MGMAE proves to be a more effective video representation learner. It benefits greatly from the harder task constructed with the motion-guided masking strategy.

ARVideo

Ren et al. in ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning, 2024 address the limitations of cube embeddings by proposing an autoregressive method termed ARVideo. Cube embeddings such as those adopted by VideoMAE and MGMAE often fail to encapsulate the rich semantics of the video. This is primarily because:

  • video tokens are dimensionally limited, and
  • video inherently lacks a sequential order in its spatial dimensions, although it retains this feature in its temporal aspects.

Figure: Comparison of video tokens and various clusters

To address these limitations, the authors propose a novel autoregressive paradigm with two key design elements:

  1. Autoregressive video tokens are organized into spatiotemporal video clusters thus differentiating this method from conventional single-dimensional strategies like spatial video clusters or temporal video clusters. This improves semantic representation by aggregating more contextually relevant multidimensional information.
  2. They adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. This random sequence order empirically yields significantly stronger results, suggesting that effectively capturing the inherent multidimensionality of video data is crucial for autoregressive modeling.

Figure: ARVideo architecture

They extend the Generative Pretrained Transformer (GPT) framework which autoregressively predicts the next element  given all preceding ones by minimizing the negative log-likelihood wrt model parameters. However, simply extending this framework to videos faces significant challenges, primarily due to the added temporal dimension. Moreover, pixels as autoregressive elements lack semantic richness compared to words in the language, further necessitating pixel grouping strategies to enhance representation learning.

In ARVideo we strategically group spatially neighbored tokens into spatial clusters and temporally adjacent into temporal clusters. These video tokens are then grouped into spatiotemporal clusters with no overlaps. They also apply a random rasterization approach that scrambles the order of clusters randomly during autoregressive pretraining. Such flexibility in autoregressive prediction orders not only captures the inherent multidimensionality of video data more effectively but also fosters a richer, more comprehensive video representation.

Figure: SOTA methods on Kinetics-400

When trained with the ViT-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 while demonstrating higher training efficiency. ARVideo trains 14% faster and requires 58% less GPU memory compared to VideoMAE.

Conclusion

In conclusion, self-supervised learning for video understanding has made significant strides in recent years, addressing the unique challenges posed by the multidimensional nature of video data. From the foundational work of VideoMAE to innovative approaches like VideoMAEv2, MGMAE, and ARVideo, researchers have tackled issues such as temporal redundancy, information leakage, and the need for more efficient and effective representation learning.

The methods presented in this post demonstrate how self-supervised learning can be adapted to the video domain by employing strategies that exploit the spatio-temporal structure of videos. This not only results in more generalized models but also reduces the compute requirements for training video-based models significantly.