CLIP and Friends: How Vision-Language Models Evolved

If you’ve been following the rapid pace of AI advancements, you’ve likely noticed the growing prevalence of language models with multimodal capabilities—models that can process multiple input types, such as images and audio (with PDFs often parsed as text). While multimodal models might feel omnipresent today, their roots are relatively recent. One pivotal development in this space is the CLIP (Contrastive Language-Image Pre-training) family of models, introduced in Radford et al.’s 2021 paper “Learning Transferable Visual Models From Natural Language Supervision.” These models fundamentally reshaped how we approach image representation learning and have been instrumental in OpenAI’s rise as a leader in AI research.

Before CLIP, traditional computer vision models were typically task-specific, often limited to producing outputs from predefined classes. This constrained their usability and generalization. Fine-tuning required carefully curated datasets, making open-source adoption more challenging. CLIP, however, took an entirely different approach: it demonstrated that learning which captions match which images is a scalable, effective way to generate high-quality image representations.

This innovation aligned with breakthroughs in natural language processing, particularly the transformer architecture, which had already shown that models trained on task-agnostic objectives like Masked Language Modelling (MLM) could outperform supervised techniques. CLIP leveraged this principle to show that natural language supervision could serve as a robust signal for image representation, unlocking zero-shot capabilities without requiring specialised output heads.

This even raises the point that initial works in vision were severely limited by the use of static softmax classifiers. Moreover, if models trained with unlabelled texts perform better than plain supervised models, this would enable significant scaling by utilising web crawl datasets! It’s much easier to scale a natural language dataset than labelling image datasets since it does not require annotations in the “1-of-N label format.”

After extensive scaling studies, CLIP was released with a series of models across 2 orders of magnitude, and it was observed that the transfer performance is a smoothly predictable function of compute. Also, like transformer models pre-trained with masked language modelling, CLIP models learn to perform a wide set of tasks during pre-training without explicit instructions.

CLIP: Contrastive Language-Image Pre-training

“Learning perception from supervision contained in natural language”

CLIP’s key innovation lies in leveraging natural language supervision, which facilitates the learning of image representations and establishes a direct connection between image and language. This enables flexible zero-shot transfer to a wide range of downstream tasks—a major advantage over traditional supervised, unsupervised, or self-supervised learning approaches. Unlike earlier supervised models, which were constrained to classifying images into fixed categories (e.g., 1,000 classes in ImageNet), CLIP learns an open set of visual concepts through its connection to natural language.

**Figure:** Summary of CLIP. **Source:** Radford et al. 2021

Instead of directly predicting captions or predefined labels for images, CLIP is trained to identify the correct pairing among a batch of NxN image-text pairs. This approach scales efficiently by learning to maximize the cosine similarity between positive image-text embeddings (matching pairs) while minimizing the similarity between negative pairs ($N^2 -N$) using a symmetric cross-entropy loss. This creates a shared multimodal embedding space optimized for alignment between visual and textual representations.

The architecture consists of separate image and text encoders, which are connected by a simple linear projection layer (affine transformation) to map their outputs into the shared embedding space. For image encoding, the authors experimented with modified ResNet and Vision Transformer (ViT) architectures, while the text encoder used a standard transformer model with Byte Pair Encoding (BPE). Interestingly, the authors noted that CLIP’s performance is relatively insensitive to the capacity of the text encoder.

During training, image augmentation is minimal and limited to random square cropping. The overall objective is clear: CLIP is pre-trained to determine whether an image and a text snippet belong together within the dataset. This task-agnostic framework makes it highly versatile for downstream applications.

BLIP: Bootstrapping Pre-training for Understanding and Generation

CLIP inspired a wave of follow-up models, but many focused exclusively on either understanding or generation tasks. Most relied on dataset scaling with noisy image-text pairs. To address these limitations, Salesforce introduced the BLIP family of models in the paper “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” (Li et al., 2022). BLIP employs a novel Multimodal Mixture of Encoder-Decoder (MED) architecture designed for efficient multi-task pre-training using a combination of loss functions. This architecture supports three distinct modes:

Unimodal Encoder (Understanding): Encodes a single modality (image or text) into intermediate representations. Like CLIP, it uses an image-text contrastive (ITC) loss to align vision and language embeddings.
Image-Grounded Text Encoder (Understanding): This encoder incorporates cross-attention layers to model interactions between modalities and is trained using an image-text matching (ITM) loss. This binary classification task determines whether a given image-text pair is a positive match.
Image-Grounded Text Decoder (Generation): Replaces bidirectional self-attention with causal self-attention, enabling it to generate text (e.g., captions). This model is trained with a language modelling (LM) loss.

**Figure:** Overview of the Multimodal mixture of Encoder-Decoder (MED) BLIP Module. **Source:** Li et al. 2022

While pre-training, all three aforementioned objectives are jointly optimised. A forward pass consists of a single pass through the image encoder (vision transformer) and three separate passes for each loss. To perform all these computations efficiently, the text encoder and decoder share all parameters except the self-attention layers.

**Figure:** Captioner and Filter explained with the MED module. **Source:** Li et al., 2022

To reduce the number of Noisy image-text pairs from web-crawled datasets, the authors propose using a captioning + filtering system. A captioner generates captions for given images, and a filter removes noisy image-text pairs (both original texts and synthetically generated ones). The captioner and filter are fine-tuned variants of the MED (image-grounded text-decoder and image-grounded text-encoder, respectively).

**Table:** BLIP vs CLIP on Zero-shot image-text retrieval (Flickr30K). **Source:** Li et al., 2022

As you have noticed, both CLIP and BLIP have trained their respective encoders. As we move into the era of the billion-parameter model, this is not always possible. Wouldn’t it be better to leverage off-the-shelf pre-trained vision and language models?

BLIP-2: Leveraging Off-the-Shelf Frozen Models

BLIP-2, the follow-up work by Li et al., aims to create a generic training framework for using off-the-shelf models for Vision-Language Pre-training. Note, however, that since they don’t train the encoders, the main crux is to align the model outputs with each other. However, the LLM has not seen images during its pre-training stage, thereby making said alignment difficult.

To alleviate this issue, they propose using a lightweight transformer called the Querying Transformer (Q-Former). This lightweight model is trained using a two-stage strategy and acts as an information bottleneck between the frozen image encoder and the frozen language model. In the first pre-training phase, Q-Former is trained to perform vision-language representation learning to learn which visual concepts are the most relevant for a given text. In the second phase, the model learns to perform vision-language generation by connecting the output of the Q-Former to the frozen LLM.

**Figure:** Q-Former model architecture. Source: Li et al. 2022

The Q-Former is trained to extract a fixed number of output features from the image encoder, independent of the image encoder resolution. It consists of 2 transformer sub-modules that share the same self-attention layers.

An Image Transformer that interacts with the frozen image encoder for visual feature extraction.
A Text Transformer that acts as both a text encoder and a text decoder.

In total, Q-Former only consists of 188M parameters and is enough to force the query values to interact with the visual information most relevant to the text.

**Table:** CLIP vs BLIP vs BLIP-2 on Zero-shot image-text retrieval (Flickr30K). Source: Li et al., 2022

SigLIP: Optimising the loss function for better scaling

The loss function employed during CLIP pre-training uses a symmetric softmax loss in a contrastive objective, which encourages image and text embeddings of match pairs to align with each other while pushing away the embeddings of unmatched pairs. This loss uses a batch-level softmax-based loss, which is applied twice to normalise the pairwise similarity scores. A simple implementation is numerically unstable, and optimisations involve a second pass over the whole batch. This makes it hard to implement in a distributed setting and binds the batch size with the loss function. Zhai et al. 2023, in their paper “Sigmoid Loss for Language Image Pre-Training”, propose to instead use a simpler alternative that avoids computing normalisation factors. Using a sigmoid-based loss, we convert the task into a simple binary classification objective while achieving a more stable, symmetric loss function requiring a single forward pass and less memory than the softmax loss. Any training instabilities can be fixed with additional bias terms, and calculations can be optimised using a “chunked” approach.

**Table:** CLIP vs SigLIP. **Source:** Zhai et al. 2023

SigLIP has replaced CLIP as the main choice for a pre-trained VLM.

Domain-Specific and Data-Efficient Training

Well, if BLIP uses three forward passes through expensive modules to deal with noisy image-text pairs and image-text pairs limit CLIP, can’t we eliminate the need for image-text pairs? Yes, as long as there is some relationship between the image and text.

**Figure:** CLIP vs MedCLIP Embedding Visualisation. **Source:** Wang et al. 2022

Wang et al. 2022 attempted to use the CLIP framework for vision-language pre-training on medical datasets. In their paper “MedCLIP: Contrastive Learning from Unpaired Medical Images and Text”, they criticise the CLIP framework for its dependence on image-text pairs. Medical image-text datasets are obviously orders of magnitude less than general image-text pairs due to the extremely high annotation costs in the medical industry. Moreover, in the medical domain, most false negatives under the normal CLIP setting would still share some semantic meaning. For example, a different view of the same person’s X-ray might not share the same caption but still has the same inherent semantic meaning (might have the same artefact).

**Figure:** MedCLIP Overview. **Source:** Wang et al. 2022

MedCLIP decouples the dependency on image-text pairs, thereby allowing the pre-training process to cover unpaid image and text datasets. This allows us to use all existing medical image-text, image-only and text-only datasets. But how do we use image-only and text-only datasets with the CLIP paradigm?

The authors use traditional entity extraction techniques to generate useful features for medical image-only and text-only datasets. Soft targets are then used in a semantic matching loss to unify all datasets.

Conclusion

In this article, we saw some of the early advancements in multimodal learning, particularly the CLIP family of models. These models enabled the use of natural language supervision at scale to learn open vision representations at scale. They allowed us to move away from predefined classification models toward more flexible, zero-shot capable models that understand visual concepts through natural language descriptions.

BLIP demonstrated that we could combine understanding and generation in a single architecture, yielding superior results. In contrast, BLIP-2 efficiently leveraged off-the-shelf existing pre-trained models (frozen models) through a new lightweight transformer architecture.

References

Learning Transferable Visual Models From Natural Language Supervision. Radford et al., 2021
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Li et al., 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Li et al., 2023
Sigmoid Loss for Language Image Pre-Training. Zhai et al., 2023
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. Wang et al., 2022

Improve your data

Today is the day to get the most out of your data. Share our mission with the world — unleash your data's true potential.

Back to blog

CLIP and Friends: How Vision-Language Models Evolved

Ideal For:

Reading time:

Category:

Share blog post

Quick summary of key points about AI model training techniques and their implementation.

TL;DR

CLIP: Contrastive Language-Image Pre-training

“Learning perception from supervision contained in natural language”

BLIP: Bootstrapping Pre-training for Understanding and Generation

Unimodal Encoder (Understanding): Encodes a single modality (image or text) into intermediate representations. Like CLIP, it uses an image-text contrastive (ITC) loss to align vision and language embeddings.
Image-Grounded Text Encoder (Understanding): This encoder incorporates cross-attention layers to model interactions between modalities and is trained using an image-text matching (ITM) loss. This binary classification task determines whether a given image-text pair is a positive match.
Image-Grounded Text Decoder (Generation): Replaces bidirectional self-attention with causal self-attention, enabling it to generate text (e.g., captions). This model is trained with a language modelling (LM) loss.

BLIP-2: Leveraging Off-the-Shelf Frozen Models

An Image Transformer that interacts with the frozen image encoder for visual feature extraction.
A Text Transformer that acts as both a text encoder and a text decoder.

In total, Q-Former only consists of 188M parameters and is enough to force the query values to interact with the visual information most relevant to the text.

SigLIP: Optimising the loss function for better scaling

SigLIP has replaced CLIP as the main choice for a pre-trained VLM.

Domain-Specific and Data-Efficient Training

Conclusion

References

Learning Transferable Visual Models From Natural Language Supervision. Radford et al., 2021
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Li et al., 2022
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Li et al., 2023
Sigmoid Loss for Language Image Pre-Training. Zhai et al., 2023
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. Wang et al., 2022

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a demo

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.

Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.

CLIP and Friends: How Vision-Language Models Evolved

CLIP: Contrastive Language-Image Pre-training

BLIP: Bootstrapping Pre-training for Understanding and Generation

BLIP-2: Leveraging Off-the-Shelf Frozen Models

SigLIP: Optimising the loss function for better scaling

Domain-Specific and Data-Efficient Training

Conclusion

References

Company & Product

Privacy terms

Postal Address

Social Media

CLIP and Friends: How Vision-Language Models Evolved

Table of contents

Share blog post

CLIP: Contrastive Language-Image Pre-training

BLIP: Bootstrapping Pre-training for Understanding and Generation

BLIP-2: Leveraging Off-the-Shelf Frozen Models

SigLIP: Optimising the loss function for better scaling

Domain-Specific and Data-Efficient Training

Conclusion

References

See Lightly in Action

Get Started with Lightly

Stay ahead in computer vision

Related Articles

Introducing LightlyTrain: Better Vision Models, Faster - No Labels Needed

NVIDIA Blackwell B200 vs H100: Real-World Benchmarks, Costs, and Why We Self-Host

YOLO Object Detection Explained: Models, Tools, Use Cases