Importance of Embeddings in Modern Deep Learning

An embedding space is a high-dimensional mathematical space where data points are represented as vectors known as embeddings. Unlike raw data representations (such as pixel values for images or characters for text), embedding spaces capture meaningful semantic relationships between data points.

An embedding space is typically populated using a pre-trained model to process raw data. For example, an image model such as ResNet can convert images into embeddings. Converting an image dataset to an embedding space could reveal information about the similarity of various classes. In the text domain, related words like “king” and “queen” might be positioned near each other in the embedding space, while unrelated words like “king” and “bicycle” would be placed far apart.

Figure: Pictorial Representation of an embedding space. Credits: @akshay_pachaar

Embedding spaces’ have become increasingly important with the rise of dataset sizes. They compress high-dimensional raw data into more manageable representations while preserving essential relationships.

Various Distance Measures

Distance measures help us understand how “close” or “far apart” two points are in an embedding space. The choice of distance measure can significantly impact how our models learn and perform. Let’s look at some standard distance measures.

Hamming Distance

Given two equal-length strings or vectors, we can find the number of positions at which a given symbol differs. This metric is known as the Hamming Distance. For example, Saurav and Sourav are both common names in India; however, they have a single character difference at the second position (index 1); therefore, the Hamming Distance between them is 1. Hamming Distance is often used in classical NLP techniques.

Manhattan Distance

The Manhattan or Taxicab Distance is a simple distance function that calculates the distance between two points as if they lie on a grid (like in a city). Numerically, it is the sum of the absolute differences between each dimension.

Equation: Manhattan Distance between two points. n denotes the number of dimensions

Euclidean Distance

Figure: Calculating the Euclidean Distance. Source

The most fundamental distance function is the Euclidean Distance function, which can be used to find the distance between two points in an Euclidean Space (the angle between the axes is 90 degrees).

Equation: Euclidean Distance between two points. n denotes the number of dimensions

It simply calculates the length of the shortest line segment between the points. Euclidean distance appears in various forms throughout deep learning; for example, the L2 loss and mean squared error use the Euclidean Distance function to determine the distance between two given data points.

Figure: Comparison of Manhattan and Euclidean Distance. Source

In the following figure, we can see the difference between the Euclidean and Manhattan Distance functions. The Green Line denotes the line segment connecting the two points; its length is the Euclidean Distance. On the other hand, all the other lines could represent the Manhattan Distance (all have the same length).

Mahalanobis Distance

The Mahalanobis distance is worth a mention even though it doesn't calculate the distance between two points of a given vector space. Instead it is used for determining the distance between a point and a distribution. It is a more general form of standard score from statistics and unlike Euclidean distance it doesn't get confused by strong correlation along dimensions. It can be particularly helpful in determining if a given point is "out-of-distribution" (OOD). However while more robust in determining OOD points it is not straightforward to implement and compute intensive (cubic in terms of the covariance matrix).

Image
Figure: Demonstration of relevance of Mahalanobis Distance. Source: @leopd on Twitter/X

As the above figure demonstrates, the point is clearly OOD however if Euclidean distance was used it wouldn't have appeared different compared to the other outer points.

Equation: Mahalanobis Distance between two points given a distribution Q and covariance matrix S.

Cosine Similarity

While not necessarily a “distance” function, cosine similarity is a widely used “measure” function (which determines the separation between two points). Using the Euclidean Dot Product formula, we can calculate the cosine between any two points.

Equation: Euclidean Dot Product
Equation: Cosine Similarity
Cosine Similarity has seen increased adoption due to the recent surge in the use of vector indexes.

Negative Arc Length

A simple extension of the cosine similarity function was proposed in Geometric Contrastive Learning by Koishekenov et al. (2023) wherein instead of simply calculating the cosine between two given embeddings, we calculate the distance between them when projected onto a hyperspace. This arises naturally since recent methods normalise the embeddings using a L2 unit norm. Thus all embeddings are mapped to a spherically symmetric hypersphere and therefore the geodesic distance between them can be used. Moreover we can subtracting the resulting arc length converting the distance metric into a similarity function with range [0,1]. This forms a key distinction from the cosine similarity function which is bounded in [-1, 1]

Equation: Negative Arc Length

Various Ways to Enforce Similarity in the Embedding Space

Many modern methods rely on enforcing similarities between vector representations of input data in the embedding space to facilitate learning. Let’s look at two such methods.

SimCLR

For a complete example on training a model using SimCLR refer to our docs

Chen et al. (2020) proposed a framework for using Contrastive Learning for images in their paper “A Simple Framework for Contrastive Learning of Visual Representations”. SimCLR learns representations by maximizing the agreement between differently augmented views of the same data sample via a contrastive loss in the embedding space.

Figure: SimCLR architecture. Source: Chen et al. (2020).

Given a random image, we generate two “views” of the image using a set of image augmentations. These augmented views are then encoded into representations using a pre-trained image backbone model. These representations are then passed through a small projector network to generate embeddings. These embeddings are then enforced to be similar during the training process using a contrastive loss function that uses cosine similarity to determine the distance between the representations.

Equation: SimCLR Loss Function

Refer to our article on Brief Introduction to Contrastive Learning, for a better understanding of this loss function.

VICReg

For a complete example on training a model using VICReg refer to our docs

Figure: VICReg Framework. Source: Bardes et al. (2022)

The VICReg framework by Bardes et al. (2022) for training joint embedding architectures is also based on a similar principle of preserving the information content of the embeddings. However, they enforce the similarity on multiple levels by an invariance term that minimizes the mean-squared distance (Euclidean) between the embedding vectors, a variance term that forces the embedding vectors of samples within a batch to be different, and a covariance term that de-correlates the variables of each embedding and prevents an informational collapse in which the variables would vary together or be highly correlated.

Multi-modal Models that Enforce Similarity Between Embeddings

CLIP 🌆 + 💬

While most vision models jointly train an image feature extractor and a linear classifier to predict some label, CLIP (Radford et al., 2021) trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. Thus, the model learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the real pairs while minimizing the cosine similarity of the embeddings of the incorrect pairings.

Figure: Overview of the CLIP Architecture. Source: CLIP Radford et al. (2021)

An important point is that CLIP uses a contrastive objective, not a predictive one. This choice of learning paradigm is primarily based on the success of contrastive representation learning over predictive objectives.

ImageBind

Girdhar et al. (2023) proposed an approach to learn a joint embedding across six different modalities (images, text, audio, depth, thermal, and IMU) in IMAGEBIND: One Embedding Space To Bind Them All.

Figure: ImageBind Overview. Source: Girdhar et al. (2023)

The authors aim to learn a single joint embedding space for all modalities by binding them with images. Moreover, they align each modality’s embedding to image embeddings. This has the added benefit that the resulting embedding space has a robust emergent zero-shot behaviour that automatically associates pairs of modalities without seeing any training data, i.e. an emergent behaviour arises in the embedding space that aligns two pairs of modalities (p, q) even though the model was only trained  using the pairs (image, a) and (image, b).

They use the InfoNCE Loss (refer to our article on contrastive learning) to align image embeddings (denoted by a) with other modalities (denoted by b).

Equation: Loss Function for ImageBind. Source: Girdhar et al. (2023)

DeCUR


Figure: Decoupled common and unique representations across two modalities. Source: Wang et al. (2023)

Wang et al. (2023) explore how to decouple common and unique representations in Multi-modal Self-Supervised Learning based on the observation that while aligning different modalities in a shared embedding space has shown success in various multi-modal scenarios, they ignore that one modality may hold unique information that can not be extracted from other modalities. This forces the model to put potentially orthogonal representations into joint feature embeddings, limiting the model’s capacity to understand different modalities in detail.

Figure: DeCUR Framework. Source: Wang et al. (2023)


Concretely, during training, DeCUR calculates the normalized cross-correlation matrix of the common dimensions and the unique dimensions between two modalities and drives the matrix of the common dimensions to identity while driving the matrix of the unique dimensions to zero. This forces common embeddings to be aligned across modalities while modality-unique embeddings are pushed away.

Simply pushing embeddings from different modalities apart would lead to collapse; thus, DeCUR also employs intra-modal learning, which utilizes all embedding dimensions and drives the cross-correlation matrix between two augmented views of the same modality to the identity.

PaLM-E: Multimodal Language Model for Embodied Reasoning

While multimodal models have been performing increasingly better on benchmarks their real-life use is often hindered by their performance on reasoning tasks such as robot planning. PaLM-E by Driess et al. (2023) attempt to create such a model capable of operating on multimodal sentences, i.e. sequences of tokens where inputs from arbitrary modalities (images or neural 3D representations) alongside text to allow for a direct integration of the rich semantic knowledge stored in pre-trained LLMs into the planning process.

Figure: Capabilities of PaLM-E. Source: Driess et al. (2023)

The main idea is to inject tokens from images and sensor modalities into the language embedding space of a pre-trained language model. These multimodal sentences can be then be processed by the self-attention layers of a Transformer-based LLM in the same way as text. This allows the model to be used for planning tasks and robotic manipulation.

Figure: PaLM-E being used for determining low-level policies of robots. Source: Driess et al. (2023)

FIND: Interfacing Foundation Models’ Embeddings

Zou et al. (NeurIPS 2024) attempt to use foundation model embeddings to expand the output space of LLMs and unlocking their potential for interleaved understanding and reasoning. Their method processes embeddings from vision and language foundation models, and outputs segmentation, grounding, and retrieval results. This requires a interleaved shared embedding space where vision and language references can be interchanged and augmented.

Figure: Multimodal vs Interleave. Source: Zou et al. (NeurIPS 2024)

Conclusion

Embedding spaces have emerged as a fundamental concept in modern deep learning, transforming how we represent and process information across multiple modalities. Their importance extends far beyond simple dimensionality reduction or feature representation – they serve as the bridge between raw data and meaningful semantic understanding. The evolution of embedding spaces has been particularly remarkable in the context of multi-modal learning, from CLIP's paired image-text embeddings to ImageBind's six-modality unified space.