CLIP and Friends: How Vision-Language Models Evolved
If you’ve been following the rapid pace of AI advancements, you’ve likely noticed the growing prevalence of language models with multimodal capabilities—models that can process multiple input types, such as images and audio (with PDFs often parsed as text). While multimodal models might feel omnipresent today, their roots are relatively recent. One pivotal development in this space is the CLIP (Contrastive Language-Image Pre-training) family of models, introduced in Radford et al.’s 2021 paper “Learning Transferable Visual Models From Natural Language Supervision.” These models fundamentally reshaped how we approach image representation learning and have been instrumental in OpenAI’s rise as a leader in AI research.
Before CLIP, traditional computer vision models were typically task-specific, often limited to producing outputs from predefined classes. This constrained their usability and generalization. Fine-tuning required carefully curated datasets, making open-source adoption more challenging. CLIP, however, took an entirely different approach: it demonstrated that learning which captions match which images is a scalable, effective way to generate high-quality image representations.
This innovation aligned with breakthroughs in natural language processing, particularly the transformer architecture, which had already shown that models trained on task-agnostic objectives like Masked Language Modelling (MLM) could outperform supervised techniques. CLIP leveraged this principle to show that natural language supervision could serve as a robust signal for image representation, unlocking zero-shot capabilities without requiring specialised output heads.
This even raises the point that initial works in vision were severely limited by the use of static softmax classifiers. Moreover, if models trained with unlabelled texts perform better than plain supervised models, this would enable significant scaling by utilising web crawl datasets! It’s much easier to scale a natural language dataset than labelling image datasets since it does not require annotations in the “1-of-N label format.”
After extensive scaling studies, CLIP was released with a series of models across 2 orders of magnitude, and it was observed that the transfer performance is a smoothly predictable function of compute. Also, like transformer models pre-trained with masked language modelling, CLIP models learn to perform a wide set of tasks during pre-training without explicit instructions.
CLIP: Contrastive Language-Image Pre-training
“Learning perception from supervision contained in natural language”
CLIP’s key innovation lies in leveraging natural language supervision, which facilitates the learning of image representations and establishes a direct connection between image and language. This enables flexible zero-shot transfer to a wide range of downstream tasks—a major advantage over traditional supervised, unsupervised, or self-supervised learning approaches. Unlike earlier supervised models, which were constrained to classifying images into fixed categories (e.g., 1,000 classes in ImageNet), CLIP learns an open set of visual concepts through its connection to natural language.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a5449300f4_67934fb02b27add3380eeb94_Screenshot%25202025-01-19%2520at%252022.39.19.png)
Instead of directly predicting captions or predefined labels for images, CLIP is trained to identify the correct pairing among a batch of NxN image-text pairs. This approach scales efficiently by learning to maximize the cosine similarity between positive image-text embeddings (matching pairs) while minimizing the similarity between negative pairs ($N^2 -N$) using a symmetric cross-entropy loss. This creates a shared multimodal embedding space optimized for alignment between visual and textual representations.
The architecture consists of separate image and text encoders, which are connected by a simple linear projection layer (affine transformation) to map their outputs into the shared embedding space. For image encoding, the authors experimented with modified ResNet and Vision Transformer (ViT) architectures, while the text encoder used a standard transformer model with Byte Pair Encoding (BPE). Interestingly, the authors noted that CLIP’s performance is relatively insensitive to the capacity of the text encoder.
During training, image augmentation is minimal and limited to random square cropping. The overall objective is clear: CLIP is pre-trained to determine whether an image and a text snippet belong together within the dataset. This task-agnostic framework makes it highly versatile for downstream applications.
BLIP: Bootstrapping Pre-training for Understanding and Generation
CLIP inspired a wave of follow-up models, but many focused exclusively on either understanding or generation tasks. Most relied on dataset scaling with noisy image-text pairs. To address these limitations, Salesforce introduced the BLIP family of models in the paper “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation” (Li et al., 2022). BLIP employs a novel Multimodal Mixture of Encoder-Decoder (MED) architecture designed for efficient multi-task pre-training using a combination of loss functions. This architecture supports three distinct modes:
- Unimodal Encoder (Understanding): Encodes a single modality (image or text) into intermediate representations. Like CLIP, it uses an image-text contrastive (ITC) loss to align vision and language embeddings.
- Image-Grounded Text Encoder (Understanding): This encoder incorporates cross-attention layers to model interactions between modalities and is trained using an image-text matching (ITM) loss. This binary classification task determines whether a given image-text pair is a positive match.
- Image-Grounded Text Decoder (Generation): Replaces bidirectional self-attention with causal self-attention, enabling it to generate text (e.g., captions). This model is trained with a language modelling (LM) loss.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a5449300fa_67934fd2368238a683656c8b_Screenshot%25202025-01-19%2520at%252023.15.01.png)
While pre-training, all three aforementioned objectives are jointly optimised. A forward pass consists of a single pass through the image encoder (vision transformer) and three separate passes for each loss. To perform all these computations efficiently, the text encoder and decoder share all parameters except the self-attention layers.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a5449300f7_67934ff2b8785d5f76cf6240_Screenshot%25202025-01-19%2520at%252023.27.05.png)
To reduce the number of Noisy image-text pairs from web-crawled datasets, the authors propose using a captioning + filtering system. A captioner generates captions for given images, and a filter removes noisy image-text pairs (both original texts and synthetically generated ones). The captioner and filter are fine-tuned variants of the MED (image-grounded text-decoder and image-grounded text-encoder, respectively).
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a5449300fd_6793500faf76dce347ee1f22_Screenshot%25202025-01-19%2520at%252023.28.30.png)
As you have noticed, both CLIP and BLIP have trained their respective encoders. As we move into the era of the billion-parameter model, this is not always possible. Wouldn’t it be better to leverage off-the-shelf pre-trained vision and language models?
BLIP-2: Leveraging Off-the-Shelf Frozen Models
BLIP-2, the follow-up work by Li et al., aims to create a generic training framework for using off-the-shelf models for Vision-Language Pre-training. Note, however, that since they don’t train the encoders, the main crux is to align the model outputs with each other. However, the LLM has not seen images during its pre-training stage, thereby making said alignment difficult.
To alleviate this issue, they propose using a lightweight transformer called the Querying Transformer (Q-Former). This lightweight model is trained using a two-stage strategy and acts as an information bottleneck between the frozen image encoder and the frozen language model. In the first pre-training phase, Q-Former is trained to perform vision-language representation learning to learn which visual concepts are the most relevant for a given text. In the second phase, the model learns to perform vision-language generation by connecting the output of the Q-Former to the frozen LLM.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a544930128_6793504b296615ddce65ed62_Screenshot%25202025-01-20%2520at%252000.08.18.png)
The Q-Former is trained to extract a fixed number of output features from the image encoder, independent of the image encoder resolution. It consists of 2 transformer sub-modules that share the same self-attention layers.
- An Image Transformer that interacts with the frozen image encoder for visual feature extraction.
- A Text Transformer that acts as both a text encoder and a text decoder.
In total, Q-Former only consists of 188M parameters and is enough to force the query values to interact with the visual information most relevant to the text.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a54493012b_67935065c469ef90ac34f87b_Screenshot%25202025-01-20%2520at%252000.11.06.png)
SigLIP: Optimising the loss function for better scaling
The loss function employed during CLIP pre-training uses a symmetric softmax loss in a contrastive objective, which encourages image and text embeddings of match pairs to align with each other while pushing away the embeddings of unmatched pairs. This loss uses a batch-level softmax-based loss, which is applied twice to normalise the pairwise similarity scores. A simple implementation is numerically unstable, and optimisations involve a second pass over the whole batch. This makes it hard to implement in a distributed setting and binds the batch size with the loss function. Zhai et al. 2023, in their paper “Sigmoid Loss for Language Image Pre-Training”, propose to instead use a simpler alternative that avoids computing normalisation factors. Using a sigmoid-based loss, we convert the task into a simple binary classification objective while achieving a more stable, symmetric loss function requiring a single forward pass and less memory than the softmax loss. Any training instabilities can be fixed with additional bias terms, and calculations can be optimised using a “chunked” approach.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a54493010c_67935082bbeedb30aededb75_Screenshot%25202025-01-20%2520at%252000.33.05.png)
SigLIP has replaced CLIP as the main choice for a pre-trained VLM.
Domain-Specific and Data-Efficient Training
Well, if BLIP uses three forward passes through expensive modules to deal with noisy image-text pairs and image-text pairs limit CLIP, can’t we eliminate the need for image-text pairs? Yes, as long as there is some relationship between the image and text.
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a54493014a_679350ac375e4b96e2f0b30d_Screenshot%25202025-01-20%2520at%252001.27.44.png)
Wang et al. 2022 attempted to use the CLIP framework for vision-language pre-training on medical datasets. In their paper “MedCLIP: Contrastive Learning from Unpaired Medical Images and Text”, they criticise the CLIP framework for its dependence on image-text pairs. Medical image-text datasets are obviously orders of magnitude less than general image-text pairs due to the extremely high annotation costs in the medical industry. Moreover, in the medical domain, most false negatives under the normal CLIP setting would still share some semantic meaning. For example, a different view of the same person’s X-ray might not share the same caption but still has the same inherent semantic meaning (might have the same artefact).
![](https://cdn.prod.website-files.com/62cd5ce03261cb3e98188470/679350f5323df6a54493010f_679350c6b8785d5f76d05a21_Screenshot%25202025-01-20%2520at%252001.26.56.png)
MedCLIP decouples the dependency on image-text pairs, thereby allowing the pre-training process to cover unpaid image and text datasets. This allows us to use all existing medical image-text, image-only and text-only datasets. But how do we use image-only and text-only datasets with the CLIP paradigm?
The authors use traditional entity extraction techniques to generate useful features for medical image-only and text-only datasets. Soft targets are then used in a semantic matching loss to unify all datasets.
Conclusion
In this article, we saw some of the early advancements in multimodal learning, particularly the CLIP family of models. These models enabled the use of natural language supervision at scale to learn open vision representations at scale. They allowed us to move away from predefined classification models toward more flexible, zero-shot capable models that understand visual concepts through natural language descriptions.
BLIP demonstrated that we could combine understanding and generation in a single architecture, yielding superior results. In contrast, BLIP-2 efficiently leveraged off-the-shelf existing pre-trained models (frozen models) through a new lightweight transformer architecture.
References
- Learning Transferable Visual Models From Natural Language Supervision. Radford et al., 2021
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Li et al., 2022
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Li et al., 2023
- Sigmoid Loss for Language Image Pre-Training. Zhai et al., 2023
- MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. Wang et al., 2022