Importance of Embeddings in Modern Deep Learning

An embedding space is a high-dimensional mathematical space where data points are represented as vectors known as embeddings. Unlike raw data representations (such as pixel values for images or characters for text), embedding spaces capture meaningful semantic relationships between data points.

An embedding space is typically populated using a pre-trained model to process raw data. For example, an image model such as ResNet can convert images into embeddings. Converting an image dataset to an embedding space could reveal information about the similarity of various classes. In the text domain, related words like “king” and “queen” might be positioned near each other in the embedding space, while unrelated words like “king” and “bicycle” would be placed far apart.

Figure: Pictorial Representation of an embedding space. Credits: @akshay_pachaar

Embedding spaces’ have become increasingly important with the rise of dataset sizes. They compress high-dimensional raw data into more manageable representations while preserving essential relationships.

Various Distance Measures

Distance measures help us understand how “close” or “far apart” two points are in an embedding space. The choice of distance measure can significantly impact how our models learn and perform. Let’s look at some standard distance measures.

Hamming Distance

Given two equal-length strings or vectors, we can find the number of positions at which a given symbol differs. This metric is known as the Hamming Distance. For example, Saurav and Sourav are both common names in India; however, they have a single character difference at the second position (index 1); therefore, the Hamming Distance between them is 1. Hamming Distance is often used in classical NLP techniques.

Manhattan Distance

The Manhattan or Taxicab Distance is a simple distance function that calculates the distance between two points as if they lie on a grid (like in a city). Numerically, it is the sum of the absolute differences between each dimension.

Equation: Manhattan Distance between two points. n denotes the number of dimensions

Euclidean Distance

Figure: Calculating the Euclidean Distance. Source

The most fundamental distance function is the Euclidean Distance function, which can be used to find the distance between two points in an Euclidean Space (the angle between the axes is 90 degrees).

Equation: Euclidean Distance between two points. n denotes the number of dimensions

It simply calculates the length of the shortest line segment between the points. Euclidean distance appears in various forms throughout deep learning; for example, the L2 loss and mean squared error use the Euclidean Distance function to determine the distance between two given data points.

Figure: Comparison of Manhattan and Euclidean Distance. Source

In the following figure, we can see the difference between the Euclidean and Manhattan Distance functions. The Green Line denotes the line segment connecting the two points; its length is the Euclidean Distance. On the other hand, all the other lines could represent the Manhattan Distance (all have the same length).

Cosine Similarity

While not necessarily a “distance” function, cosine similarity is a widely used “measure” function (which determines the separation between two points). Using the Euclidean Dot Product formula, we can calculate the cosine between any two points.

Equation: Euclidean Dot Product
Equation: Cosine Similarity
Cosine Similarity has seen increased adoption due to the recent surge in the use of vector indexes.

Various Ways to Enforce Similarity in the Embedding Space

Many modern methods rely on enforcing similarities between vector representations of input data in the embedding space to facilitate learning. Let’s look at two such methods.

SimCLR

For a complete example on training a model using SimCLR refer to our docs

Chen et al. (2020) proposed a framework for using Contrastive Learning for images in their paper “A Simple Framework for Contrastive Learning of Visual Representations”. SimCLR learns representations by maximizing the agreement between differently augmented views of the same data sample via a contrastive loss in the embedding space.

Figure: SimCLR architecture. Source: Chen et al. (2020).

Given a random image, we generate two “views” of the image using a set of image augmentations. These augmented views are then encoded into representations using a pre-trained image backbone model. These representations are then passed through a small projector network to generate embeddings. These embeddings are then enforced to be similar during the training process using a contrastive loss function that uses cosine similarity to determine the distance between the representations.

Equation: SimCLR Loss Function

Refer to our article on Brief Introduction to Contrastive Learning, for a better understanding of this loss function.

VICReg

For a complete example on training a model using VICReg refer to our docs

Figure: VICReg Framework. Source: Bardes et al. (2022)

The VICReg framework by Bardes et al. (2022) for training joint embedding architectures is also based on a similar principle of preserving the information content of the embeddings. However, they enforce the similarity on multiple levels by an invariance term that minimizes the mean-squared distance (Euclidean) between the embedding vectors, a variance term that forces the embedding vectors of samples within a batch to be different, and a covariance term that de-correlates the variables of each embedding and prevents an informational collapse in which the variables would vary together or be highly correlated.

Multi-modal Models that Enforce Similarity Between Embeddings

CLIP 🌆 + 💬

While most vision models jointly train an image feature extractor and a linear classifier to predict some label, CLIP (Radford et al., 2021) trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. Thus, the model learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs while minimizing the cosine similarity of the embeddings of the incorrect pairings.

Figure: Overview of the CLIP Architecture. Source: CLIP Radford et al. (2021)

An important point is that CLIP uses a contrastive objective, not a predictive one. This choice of learning paradigm is primarily based on the success of contrastive representation learning over predictive objectives.

ImageBind

Girdhar et al. (2023) proposed an approach to learn a joint embedding across six different modalities (images, text, audio, depth, thermal, and IMU) in IMAGEBIND: One Embedding Space To Bind Them All.

Figure: ImageBind Overview. Source: Girdhar et al. (2023)

The authors aim to learn a single joint embedding space for all modalities by binding them with images. Moreover, they align each modality’s embedding to image embeddings. This has the added benefit that the resulting embedding space has a robust emergent zero-shot behaviour that automatically associates pairs of modalities without seeing any training data, i.e. an emergent behaviour arises in the embedding space that aligns two pairs of modalities (p, q) even though the model was only trained  using the pairs (image, a) and (image, b).

They use the InfoNCE Loss (refer to our article on contrastive learning) to align image embeddings (denoted by a) with other modalities (denoted by b).

Equation: Loss Function for ImageBind. Source: Girdhar et al. (2023)

DeCUR


Figure: Decoupled common and unique representations across two modalities. Source: Wang et al. (2023)

Wang et al. (2023) explore how to decouple common and unique representations in Multi-modal Self-Supervised Learning based on the observation that while aligning different modalities in a shared embedding space has shown success in various multi-modal scenarios, they ignore that one modality may hold unique information that can not be extracted from other modalities. This forces the model to put potentially orthogonal representations into joint feature embeddings, limiting the model’s capacity to understand different modalities in detail.

Figure: DeCUR Framework. Source: Wang et al. (2023)


Concretely, during training, DeCUR calculates the normalized cross-correlation matrix of the common dimensions and the unique dimensions between two modalities and drives the matrix of the common dimensions to identity while driving the matrix of the unique dimensions to zero. This forces common embeddings to be aligned across modalities while modality-unique embeddings are pushed away.

Simply pushing embeddings from different modalities apart would lead to collapse; thus, DeCUR also employs intra-modal learning, which utilizes all embedding dimensions and drives the cross-correlation matrix between two augmented views of the same modality to the identity.

Conclusion

Embedding spaces provide a powerful mathematical framework to understand relationships in data through geometric principles. In this article, we saw some of the most commonly used distance measures and how they are used in vision and multi-modal deep learning systems.