Self-Supervised Learning for Autonomous Driving

In the past few years, Self-Supervised Learning has been proven to perform at par with and, in some instances, better than supervised learning. This has enabled its usage in industries where data annotation is incredibly expensive and time-consuming, such as medical imaging and autonomous driving.

In this article, we will cover some recent methods showcasing the use of Self-Supervised Learning in various tasks in the autonomous driving domain.

Self-Supervised Learning for Monocular Depth Estimation

Monocular Depth Estimation is a fundamental problem which involves estimating the depth of various objects in a given scene using a single RGB (monocular) image. Most methods can be broken down into two fundamental classes:

  • Regression-based methods: These models can output the depth per frame of various pixels. Such methods rely on large volumes of difficult-to-obtain ground truth depth and pose measurements.
  • Window-based: methods split the frame into several bins or windows to reduce the task's complexity.

On the other hand, Self-Supervised Learning doesn't rely on ground truth data. It provides a way to learn latent variables such as motion by incorporating geometric and temporal constraints to infer a given scene's structure effectively.

Tackling low-resolution images

One of the first hurdles for Self-Supervised Learning applied to depth estimation is learning from low-resolution images. This was typically due to the large memory requirements of models and the corresponding Self-Supervised loss objective. High-resolution images are essential for tasks like depth estimation for autonomous driving since they enable robust long-term perception, prediction, and planning. SuperDepth by Pillai et al.(2019)  proposed a simple solution inspired by the super-resolution literature.

TL;DR: They propose to use subpixel-convolutional layers to super-resolve disparities from lower-resolution outputs, replacing the deconvolution or resize-convolution up-sampling layers.

Modelling depth estimation is challenging since it is typically an ill-poised inverse problem. Many 3D scenes can correspond to the same 2D image. Therefore, since depth estimation involves determining the depth of various objects in a given scene, the inherent ambiguity of a single image leads to inaccuracies. One way to solve this is to use temporal or geometric heuristic constraints.

Earlier methods used deconvolutions, resize-convolutions or simple interpolation methods from traditional image processing, such as bilinear or nearest neighbours, to perform upsampling. In SuperDepth, the authors replace interpolation layers while learning relevant low-resolution convolutional features using a sub-pixel convolutional layer for depth super-resolution that performs high-quality disparity synthesis. The final convolutional output is then re-mapped to the target depth resolution via a pixel re-arrange operation, resulting in an efficient sub-pixel convolutional operation. This super-resolved depth operates at higher resolutions and tends to reduce ambiguities in a Self-Supervised photometric loss.

Exploiting Latent Information

In Semantically-Guided Representation Learning for Self-Supervised Monocular Depth, Guizilini et al. (ICLR 2020) leverage semantic information to improve monocular depth prediction in a Self-Supervised way. In particular, they use pre-trained semantic segmentation networks to guide geometric representation learning and pixel-adaptive convolutions to learn semantic-dependent representations.

Figure: Proposed algorithm with semantically-guided feature learning. Source: Guizilini et al. (ICLR 2020)

In 3D Packing for Self-Supervised Monocular Depth Estimation, Guizilini et al. (CVPR 2020) propose PackNet, a new CNN architecture with packing and unpacking blocks that jointly leverage 3D convolutions to learn representations that maximally propagate dense appearance and geometric information, along with a loss that can optionally leverage the camera's velocity when available. These preserve and process spatial information in the features of encoding and decoding layers. They also show that, by simply using the instantaneous velocity of the camera during training, they can learn a scale-aware depth and pose model, alleviating the impractical need to use LiDAR ground-truth depth measurements at test time.

Figure: Proposed Packing and Unpacking blocks. Source: Guizilini et al. (CVPR 2020)

  • Packing Block: This block folds the spatial dimensions of convolutional feature maps into extra feature channels via a Space2Depth (Shi et al. 2016) operation (reduced resolution but lossless and invertible). The block is then expanded using a 3D convolutional layer and flattened before a final 2D convolutional contraction layer. This module thus learns to compress key spatial details that need to be preserved for high-resolution depth decoding.
  • Unpacking Block: This block decompresses and unfolds using a 2D convolutional layer to produce the required number of feature channels for a following 3D convolutional layer, which expands back the compressed spatial features. These unpacked features are then converted to spatial details via a reshape and Depth2Space operation (Shi et al. 2016).

With the increasing popularity of Vision Transformers and the demand for zero-shot scale-aware models, several works involving transformers and variational models also emerged. One such approach is ZeroDepth, which is robust to the geometric domain gap and thus capable of generating metric predictions across different datasets.

Figure: Proposed ZeroDepth Framework. Source: Ghiziliani et al. (CVPR 2023)

  • They employ input-level geometric embeddings to jointly encode camera parameters and image features, which enables the network to reason over the physical size of objects and learn scale priors.
  • They also decouple the encoding and decoding stages via a learned global variational latent representation. Once conditioned, this latent representation can be sampled and decoded to generate multiple probabilistic predictions.

Another important line of work in Self-Supervised monocular depth estimation is Monodepth by Godard et al (CVPR 2017). Moving away from ground truth data, they propose using epipolar geometry constraints to generate disparity images by training a network with a novel training loss that enforces consistency between the disparities produced relative to the left and right images. In particular, the authors propose to enforce similarities between binocular images with the intuition that if we learn to reconstruct one from the other, the model learns something about the 3D shape of the scene being imaged.

At training time, the model can access two images corresponding to the left and right colour images from a calibrated stereo pair captured at the same time. But instead of trying to predict the depth at every pixel directly, the model learns dense correspondence fields, which, when applied to either image, helps the model reconstruct the other one. Given this learnt dense correspondence field d, the baseline distance between the cameras b and the focal length f, the depth can be trivially calculated as (bf/d).

The monodepth(v1) network estimates depth by inferring the disparities that warp the left image to match the right one. The model generates the predicted image with backward mapping using a bilinear sampler, resulting in a fully differentiable image formation model. They experiment with various mappings and find that naively learning to generate the right image by sampling from the left produces disparities aligned with the right image. Moreover, generating the left view by sampling from the right image creates a left-view-aligned disparity map. Thus, they propose to train the network to predict the disparity maps for both views by sampling from the opposite input images.

Figure: Various Sampling Strategies used in Monodepth. Source: Godard et al. (CVPR 2017)

Going multi-modal

In Monodepth(2), Godard et al. (CVPR 2018). attempt to build on top of their previous work by making simple modifications to create a model capable of using monocular video, stereo pairs, or both. In particular, they propose a novel appearance-matching loss to address the problem of occluded pixels that occur when using monocular supervision. This happens because when computing the reprojection error from multiple source images (monocular video or stereo pairs), Self-Supervised methods average the reprojection error into each available source image. This leads to issues with pixels visible in the target image but not in some of the source images. This happens because of out-of-view pixels due to ego-motion at image boundaries and occluded pixels. Instead of averaging the photometric error over all source images, they use the minimum to deal with this.

Equation: Propsed per-pixel photometric loss. Source: Godard et al.

Figure: Benefit of min. reprojection loss. Source: Godard et al.

They also observe that pixels that remain the same between adjacent frames in the sequence often indicate a static camera, an object moving at equivalent relative translation to the camera, or a low-texture region. This lets the network ignore objects moving at the same velocity as the camera and even ignore whole frames in monocular videos when the camera stops moving. Thus, they employ a simple auto-masking method that filters out pixels that do not change their appearance in the sequence from one frame to the next.

Figure: Effect of propsed auto-masking. Source: Godard et al.

Also, building on top of monodepth, due to the gradient locality of the bilinear sampler, and to prevent the training objective from getting stuck in local minima, existing models use multi-scale depth prediction and image reconstruction. This tends to create 'holes' in large low-texture regions in the intermediate lower-resolution depth maps and texture-copy artefacts. Thus, they propose decoupling the disparity images' resolutions and the colour images used to compute the reprojection error. This procedure is similar to matching patches, as low-resolution disparity values will warp an entire 'patch' of pixels in the high-resolution image.

Going Foundational

With the rise of foundational models, there has been a trend of moving away from metadata, such as camera intrinsic, to a purely zero-shot approach for depth estimation. In “Depth Pro: Sharp Monocular Metric Depth in Less Than a Second”, Bochkovski et al. (2024) present a model capable of synthesizing high-resolution depth maps with unparalleled sharpness and high-frequency details.

Figure: Comparison of DepthPro with other SOTA work. Source: Bochkovski et al. (2024)

The key idea in this paper is to “apply plain vision transformer (ViT) encoders on patches extracted at multiple scales and fuse the patch predictions into a single high-resolution dense prediction in an end-to-end trainable model.

Figure: DepthPro architecture. Source: Bochkovski et al. (2024)

They use two ViT encoders, a multi-scale patch encoder (enabling scale invariance) and an image encoder (anchoring the patch predictions in a global context). They also employ transfer learning by using pre-trained vision transformers, thereby increasing the flexibility of the model design. Moreover, using a patch-based approach increases computational efficiency compared to scaling up the ViT to higher resolutions because of the inherent quadratic scaling of self-attention. They use a mix of real and synthetic datasets, which has been shown to increase generalization as measured by zero-shot accuracy (Ranftl et al. 2019).

Self-Supervised Learning for Ego-Motion Estimation

Ego-motion estimation involves modelling the motion of a camera relative to scene data and is a fundamental task in perception, navigation, and planning. Most methods learn directly from raw data, typically using the proxy photometric loss as a supervisory signal.

In Two Stream Networks for Self-Supervised Ego-Motion Estimation, Ambrus et al. (2019) develop a novel two-stream network combining images and inferred depth for accurate camera ego-motion estimation. They show that performance in the Self-Supervised Structure-from-Motion (SfM) regime critically depends on the choice of the model architecture and the specific ego-motion optimisation as opposed to prior works that engineer the loss function to handle errors.

Figure: Proposed Two Stream Network Architecture for Self-Supervised depth and camera ego-motion learning method. Source: Ambrus et al (2019)

Inspired by multi-task and multi-modal network architectures, the authors treated RGB images. They predicted monocular depth as two separate input modalities and for a two-stream architecture tailored for Self-Supervised ego-motion learning. They also show that the proposed network architecture can extract and appropriately fuse RGB-D information from each branch for accurate ego-motion estimation.

Rather than relying only on a single RGB input, they use a second modality by passing the estimated depth, along with the RGB, as inputs to the network (allowing the model to learn appearance and geometry features). Using the Euler parameterization, they then output a vector representing a 6-DOF transformation between the input frames.

Self-Supervised Learning for Camera Self-Calibration

Camera calibration is another crucial problem that involves inferring the geometric properties of the scene from visual input streams. This typically requires specialized action and careful tuning. However, Self-Supervised Learning can bypass explicit calibration by inferring per-scene projection models that optimize a view-synthesis objective. A learning algorithm must be flexible to adapt to various camera lens types and unstructured environments (analogous to zero-shot performance).

The general framework for learning self-calibration is the same for monocular depth estimation. We jointly learn a depth network that learns depth maps for some target images while a pose network predicts the relative rigid-body transformation for a given number of frames. The model is then jointly trained to minimize the reprojection error between the actual target image and a synthesised image created by projecting pixels from a  context image (usually a preceding image) onto the target image using the predicted depth map and ego-motion.

Figure: proposed Self-Supervised self-calibration architecture. Source: Fang et al. (2022)

In Self-Supervised Camera Self-Calibration from Video, Fang et al. (2022) present one of the earliest frameworks for camera self-calibration. They use the UCM (Geyer et al. 2000) parametric global central camera model, which uses only five parameters to represent a diverse set of camera geometries, including perspective, fisheye, and catadioptric. Using the method proposed by Usenko et al. (2018), they obtain an end-to-end differentiable network enabling learning self-calibration end-to-end from the aforementioned view synthesis objective alone.

Also, unlike the prior framework, this framework enables learning from videos (sequence of frames) rather than on a per-frame basis. The context images are the previous frames in a given sequence, thus adding a temporal nature to the learnt camera parameters.

Figure: Proposed framework for Self-Supervised extrinsic self-calibration. Source: Kanai et al. (2023)

Extending this paradigm to a multi-camera setup, Kanai et al. (2023), in Robust Self-Supervised Extrinsic Self-Calibration, utilizes scale-aware depth networks and curriculum learning to estimate accurate, metrically scaled extrinsic from unlabeled image sequences solely based on photometric consistency as a training objective. Moreover, the authors also report improved monocular depth estimation performance by jointly optimising for ego-motion parameters.

Conclusion

This article provides an overview of recent advancements in Self-Supervised Learning for autonomous driving tasks, focusing on three key areas: monocular depth estimation, ego-motion estimation, and camera self-calibration.

By leveraging geometric and temporal constraints, these techniques can effectively learn from unlabeled data, potentially improving the accuracy and robustness of autonomous driving systems in an industry where labelling has proven expensive.