A Brief Introduction to Vision Language Models

Text Foundational models pre-trained on large unlabeled web-scraped datasets have recently become popular. At the same time, strong vision models started emerging. They could perform well on most tasks, such as segmentation, detection, and classification, while having excellent generalisation ability and adapting to new datasets and tasks. However, there is also great value in jointly training a model with both vision and text data. The success of LLMs and vision models led to a flurry of research in Vision Language Models (VLMs) exemplified by DALL-E 2 (Ramesh et al., 2022) and Flamingo (Alayrac et al., 2022) and the development and release of new tasks and datasets for evaluation.

As of August 2024, breakthroughs like GPT-4 (OpenAI, 2023), PaLM 2 (Google, 2023) and LLAVA (Liu et al., 2023) continue to push boundaries, spurring the development of novel evaluation tasks and datasets. This convergence of vision and language promises transformative applications across industries, from advanced image generation to intuitive human-computer interaction. This article will cover some fundamental VLM architectures leading to current SOTA techniques and ideas.

Figure: Overview of Recent VLM Advancements

Quick Overview

Vision Language Models (VLMs) take images and text as inputs and output text. The success of VLMs relies on two prior developments:

  • Performant Large Language Models
  • Performant Vision Encoders

By smartly combining these unimodal pre-trained models, we can leverage each model's representation learning capabilities to create performant VLMs on multimodal tasks.

✂️ CLIP and BLIP: Contrastive Training

One of the first ideas in Vision-Language Modeling was proposed in CLIP (Contrastive Language-Image Pre-training) by Radford et al. (2021). The authors questioned whether scalable pre-training methods that learn directly from web text could result in a breakthrough for vision akin to language modelling.

CLIP aimed to study the behaviours of image classifiers trained with natural language supervision at a large scale.

Figure: Overview of the CLIP Architecture. Source: CLIP Radford et al. (2021)

The core idea behind CLIP is to learn perception from supervision contained in natural language. Natural Language supervision has the added advantage over most unsupervised or self-supervised learning approaches. It doesn't "just" know a representation but also connects that representation to language, enabling flexible zero-shot transfer.

While most vision models jointly train an image feature extractor and a linear classifier to predict some label, CLIP trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. The model learns a multimodal embedding space by jointly training an image encoder and text encoder to maximise the cosine similarity of the image and text embeddings of the real pairs in the batch while minimising the cosine similarity of the embeddings of the incorrect pairings.

A key point to note here is that CLIP uses a contrastive objective, not a predictive one. That is, the model doesn't try to predict the exact words of the text accompanying each image. This choice of learning paradigm is based on the success of contrastive representation learning over equivalent predictive objectives.

Along with releasing a dataset consisting of 400M image-text pairs, the authors benchmarked CLIP’s zero-shot transfer performance on over 30 existing datasets. They found it to be competitive with prior task-specific supervised models.

CLIP has had a significant impact on the field of Vision Language Models. It has been used to curate datasets for text-to-image generation models and to rank generated images. It’s one of the critical elements for any Vision Language Model since most of them use frozen CLIP encoders to generate latent representations for images.

Figure: Architecture and Pre-Training Objectives of BLIP. Source: Li et al. 2022

Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (BLIP)  by Li et al. (2022) proposed a new mixture of expert Encoder Decoder (MED) architecture for effective multi-task pre-training and flexible transfer learning. This MED architecture can operate as an unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. Moreover,  BLIP was jointly pre-trained with three vision-language objectives: image-text contrastive Learning, image-text matching, and image-conditioned language modelling.

Jointly optimising these objectives (a mixture of understanding and generative tasks) enables SOTA to perform well on various vision language tasks.

  • Image-Text Contrastive Loss (Unimodal Encoder): Aligns the feature space of the vision encoder and the text encoder by encouraging positive image-text pairs to have representations similar to the opposing pairs.
  • Image-Text Matching Loss (Image-Grounded Text Encoder): Learns image-text multimodal representation that captures the fine-grained alignment between vision and language. This is a binary classification task wherein the model predicts whether the given image and text pair match.
  • Language Modeling Loss (Image-Grounded Text Decoder): The model learns to generate textual descriptions given an image autoregressively. This enables the model to generalise and convert visual information into coherent captions.

The text encoder and text decoder share all parameters except the Attention layers to perform efficient pre-training while leveraging multi-task Learning.

Moving away from Joint Pre-Training

However, since Vision Language Models combine Vision and Language Models, it’s a natural assumption that they can leverage pre-trained unimodal models  from the vision and language domains without requiring a joint pre-training strategy. This would lead to more efficient pre-training since we could focus solely on vision-language alignment.

A problem with this approach is that LLMs have not seen image data during their pre-training process. Several methods have been proposed to solve this “modality gap”. In this article, we’ll discuss BLIP-2, Frozen and Flamingo.

BLIP-2: Bootstrapping Pre-trained Vision and Language Models

Li et al. (2023), in their paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, propose to solve this problem by efficiently learning relevant visual tokens using an attention-based approach.

Figure: Overview of the BLIP 2 Framework. Source: Li et al. (2023)

In particular, they break VLM Pre-Training into two steps:

  • Vision-Language Representation Learning: The authors propose a lightweight “Querying” Transformer module that learns a set of learnable query vectors to extract visual features from the frozen image encoder. This module essentially serves as an information bottleneck between the frozen image encoder and the frozen LLM. It then outputs the most relevant information from the visual encoder to the LLM for further training. The Querying Transformer uses shared self-attention layers to extract a fixed number of output features from the image encoder, independent of input image resolution.

Figure: Overview of the Querying Transformer Module from BLIP-2. Source: Li et al. (2023)

  • Vision-Language Generative Learning: Similar to the BLIP above framework, the overall model is trained using multiple pre-training objectives for each mode and different intended uses.

Since the querying transformer has been pre-trained to extract only relevant information from the images, it reduces the burden of the LLM to learn vision-language alignment, thus mitigating the catastrophic forgetting problem.

🧊 Frozen: Extending LLMs with Visual Prompting

Tsimpoukelli et al. introduce Frozen in Multimodal Few-Shot Learning with Frozen Language Models (2021), a method for giving a pre-trained language model access to visual information in a way that extends its few-shot learning capabilities to a multimodal setting without changing its weights.

Frozen consists of a neural network trained to encode images into the word embedding space of a large pre-trained language model such that the language model generates captions for those images. The weights of the language model are kept frozen, but gradients are back-propagated through it to train the image encoder from scratch.

Since it uses a pre-trained language model, Frozen exhibits strong zero-shot performance on multimodal tasks that it was not trained on, such as visual question answering (VQA). Therefore, Frozen is a multimodal few-shot learner, bringing the language-only capabilities of rapid task adaptation enabled by prompting to a multimodal setting.

Figure: Showcasing the ability of Frozen to generate open-ended outputs that adapt to both images and text and use facts that it learned during language-only pre-training. Source: Tsimpoukelli et al. (2021)

The authors refer to Frozen as a system for genuinely open-ended and unconstrained linguistic interpretation of images that often produces compelling output.

Figure: During Training, gradients are propagated through a frozen language model's self-attention layers and then used to train the vision encoder. Source: Tsimpoukelli et al. (2021)

The Frozen architecture consists of a pre-trained language model and a vision encoder (a variant of NF-ResNet-50). A given raw image is transformed into a continuous sequence to be consumed by the transformer by linearly mapping the vision encoder's output into higher dimensions and then reshaping the result as a sequence of embeddings, each with dimensionality the same as the language model's. The authors refer to this sequence as a visual prefix since it plays the same functional role in the transformer architecture as (part of) an embedding sequence of prefix tokens. During Training, only the parameters of the vision encoder are updated using paired image-caption data. This also makes the system modular as it uses an existing language model. It is also quite simple since it only involves training a visual encoder relying on the capabilities of an existing language model.

While Frozen provides an excellent framework for prompting VLMs, it alone doesn’t surpass SOTA performance. This highlights that knowledge in transformer language models can transfer to non-linguistic tasks.

🦩 Flamingo: Few-Shot Learning VLMs

Prompting has emerged as a critical tuning technique for using foundational models. Alayrac et al. (2022) introduced Flamingo for few-shot Learning on a wide range of open-ended vision and language tasks simply by being prompted with a few input/output examples.

This model must ingest a multimodal prompt containing images and videos interleaved with text. Therefore, they are visually conditioned autoregressive text generation models able to ingest a sequence of text tokens interleaved with images and videos and produce text as output.

Figure: Overview of the Flamingo architecture. Source: Alayrac et al. (2022)

  • Pixel Space to Feature Space: A vision encoder produces a latent representation of a given image.
  • Feature Maps to Visual Tokens: This vector is fed into a “Perceiver Resampler” that bridges the vision encoder and the language model. This module can take variable inputs and output a fixed number of tokens to the Language model.
  • The model is interleaved with pre-trained and frozen text-only LM blocks with gated cross-attention dense blocks trained from scratch that cross-attend to the visual output from the Perceiver Resampler.

This method leads to an image-causal modelling task wherein the full text-to-image cross-attention matrix is masked by which visual tokens the model sees at each text token. At a given text token, the model attends to the visual tokens of the image that appeared just before it in the interleaved sequence rather than all previous images.Though the model only directly attends to one image at a time, the dependency on all previous images remains via self-attention in the Language model.

Florence: Building Foundational Models for Vision

Bommasani et al. (2021) first introduced the term foundational model to refer to any model trained from broad data at a scale capable of being adapted to a wide range of downstream tasks.

Figure: The Florence Project. Source: Yuan et al. (2021)

Florence: A New Foundation Model for Computer Vision by Yuan et al. (2021) aimed to create a foundational model for vision encompassing Space viz. Sparse (scene-level tasks) and Coarse image data (object detection), Time viz. Static (images), dynamic (video), and Modality viz. simple RGB images, videos and multi-channel images with transferability to downstream tasks such as zero-/few-shot Learning and full fine-tuning.

Florence was trained without the assumptions of traditional methods like CLIP, which assumed that each image-text pair has its unique caption, which allows other captions to be considered negative examples. This can be a limiting factor when scaling the pre-training dataset, as in web-scale data, multiple images can be associated with identical captions.

To achieve this, they employ a unified image-text contrastive learning (UniCL) paradigm in which the model is trained in the image-label-description space. In particular, given an image-text pair, they generate a triplet (x, t, y) via a text hash table, where x is the image, t is the language description (i.e., hash value), and y is the language label (i.e., hash key) indicating the index of unique language description in the dataset. This allows them to map identical language descriptions to the same hash key, i.e., language label. Thus, all image-text pairs mapped to the same label, y, are regarded as positive in the universal image-text contrastive Learning.

This allows them to unify two fundamental paradigms:

  • Supervised Training in the form of mapping images to the label for learning discriminative representations and
  • Contrastive Learning for assigning each description with a unique label for language-image pre-training.

However, there are two fundamental limitations to the Florence Framework.

  • Availability of comprehensive visual annotations across spatial hierarchy(sparse vs coarse, static vs dynamic) and semantic granularity(varying levels of text descriptions).
  • Lack of a unified framework capable of adapting across various vision tasks in a task-agnostic manner, even accommodating new tasks with minimal or no task-specific fine-tuning.

To address these issues, Xiao et al. (2023) released the Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, which proposed a universal backbone achieved through multitask learning with extensive visual annotations.

Figure: Overview of Florence 2. Source: Xiao et al. (2023)

To develop a “universal” model capable of performing a range of tasks, the authors pre-train on several tasks across multiple granularity levels, such as Image-level understanding via image classification, captioning, and visual question answering; Region/pixel-level recognition via object detection, segmentation, and referring expression comprehension; and Fine-grained visual-semantic alignment.

Moreover, Florence 2 unifies all the above mentioned tasks under a single sequence-to-sequence language modelling objective. A vision encoder converts images into visual token embeddings, concatenated with text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response.

During training and inference, the model is prompted using task descriptions. If the prompt is simply plain text, such as “What does the image describe?” the model does no special formatting. However, localisation tokens representing quantized coordinates are added to the prompt if the model is prompted to perform region-specific tasks. This enables the model to process region-specific tasks under a language modelling paradigm and eliminates the need for task-specific adapters.

LLaVA: Instruction Tuning to generate Multimodal Models

Most of the models we have discussed so far limit the use of language to describe only the image content. While this allows us to map visual signals to language semantics, it leads to models that usually have a fixed interface with limited interactivity. More importantly, they can’t adapt to the user’s instructions.

On the other hand, Large Language Models (LLMs) have shown that language can be a universal interface for a general-purpose assistant. Moreover, recent works have used machine-generated high-quality instruction-following samples to improve the LLM’s alignment ability, reporting impressive performance compared with proprietary LLMs.

Following this, the authors of Visual Instruction Tuning present a text-only line of work for visual instruction tuning, i.e., a way to extend instruction tuning to the language-image multimodal space.

  • Visual Instruction Data Generation: The authors leverage language-only GPT-4 or ChatGPT to create instruction-following data involving visual content. In particular, they prompt a text-only LLM to generate captions describing the visual scene from various perspectives and bounding boxes localizing the objects in the scene, with each box encoding the object concept and its spatial location.

Figure: LLaVA Model Architecture. Source: Liu et al. (2023)

  • LLaVA Model Architecture: A key concept in all Vision Language Models is to effectively leverage the capabilities of both the pre-trained LLM and visual model. The authors use the Vicuna model as the LLM because it has the best instruction-following capabilities in language tasks among publicly available checkpoints. They use a pre-trained CLIP visual encoder and a simple linear layer to connect image features into the word embedding space.

Moreover, although LLaVA is trained with a small multi-modal instruction-following dataset (∼80K unique images), it demonstrates similar reasoning results with multimodal GPT-4. The LLaVA framework is also highly efficient; empirically, pre-training on the CC-595K dataset was completed within 4 hours. Finetuning on Instruct-158K is completed within 10 hours, and Finetuning on the ScienceQA dataset is completed within 4 hours.In their follow-up work, LLaVA 1.5 Liu et al. (2023) show that the fully connected vision-language connector is incredibly powerful with simple modifications. Their newer 13B checkpoint model uses merely 1.2M publicly available data and finishes full training in ∼1 day on a single 8xA100 node. These advancements can be attributed to the use of:

  • A two-layer MLP as a vision-language connector and
  • the use of additional academic-task-oriented VQA datasets.

To accommodate for bigger images, thereby allowing the LLM to clearly “see” the details of images, they swap the encoder with a more modern CLIPViT-L-336px encoder.Recently, Zhang et al. (2024) also released LLaVA-NeXT, capable of handling even bigger images with an increased input image resolution of ~4x more pixels, allowing it to grasp more visual details. LLaVA-NeXT supports three aspect ratios, up to 672x672, 336x1344, and 1344x336 resolution.It allows better visual reasoning and OCR capability since it was trained with improved visual instruction-tuning data mixture. LLaVA-NeXT re-uses the pre-trained connector of LLaVA-1.5 and less than 1M visual instruction tuning samples during training.

Figure: “AnyRes” technique. Source: Liu et al.

The authors use a dynamic strategy to accommodate images of various resolutions. The image is divided into smaller resolution patches for which the vision encoder is initially trained and encoded independently. These are then combined into a single large feature map of the target resolution and fed to the LLM.

Figure: The “AnyRes” technique can take images as a sequence of concatenated visual tokens, allowing unified image and video input. This naturally supports the evolution from multi-image to multi-frame. Source: Zhang et al.

In their latest release, LLaVA-OneVision Li et al. (2024) is a family of open large multimodal models (LMMs) that improve the performance boundaries of open source LMMs in three crucial vision settings: single-image, multi-image, and video scenarios.

Figure: Overview of the LLaVA-OneVision Framework. Source: Li et al. (2024)

Simply scaling up the LLM achieves performance comparable to GPT-4V on selected benchmarks. It also employs a new Higher AnyRes strategy as a flexible visual representation framework adaptable for multi-image and video representation.

Conclusion

This article overviewed the recent developments in the field of Vision Language Models. From early contrastive learning approaches like CLIP to more advanced models like Flamingo and LLaVA, these systems are increasingly capable of tasks like image captioning, visual question answering, and following complex instructions involving visual content.