Efficient Training for Multimodal Vision Models: Techniques and Trade-offs
Recently, Vision Language Models have increased in popularity and industry adoption. This has largely been driven by the development of open LLMs, which enable researchers to leverage unimodal pre-trained models to create VLMs. Moreover, the literature reveals different disparate design choices.
This article examines the key techniques and trade-offs in effectively training multimodal vision models, focusing on the model performance, training time, and resource utilization trade-off.
Multimodal Design Choices
Most early works in Multimodal models were inspired by LLMs, which have shown to be good few-shot learners (through prompting). Flamingo (Alayrac et al. 2022) was one of the first methods that showed image or video tasks can be cast as text prediction problems with visual input conditioning provided the model can ingest a multimodal input prompt wherein images and/or videos are interleaved with text. Flamingo (Alayrac et al. 2022) could ingest high-resolution images or videos thanks to a Perceiver-based architecture that could produce a small fixed number of visual tokens per image/video, given a large and variable number of visual input features. This made it natural for Flamingo to be used for in-context few-shot learning.
Since the introduction of Frozen (Tsimpoukelli et al., 2021) and Flamingo (Alayrac et al. 2022), most VLMs have been built on top of unimodal pre-trained backbones rather than training entirely new models from scratch. Therefore, Training a VLM often involves initialising new parameters to connect a pre-trained text and vision backbone. These parameters are then tuned during the pre-training phase.
However, since these models build on top of pre-trained LMs, and as a side effect, they directly inherit their weaknesses, such as hallucinations and poor generalisation to long sequence lengths.
Cross-Attention
Alayrac et al. (2022) introduced the cross-attention architecture, wherein the image-hidden states encoded by the vision backbone are used to condition the frozen language model using freshly initialised cross-attention layers interleaved between the pre-trained language model layers. The keys and values in these layers are obtained from the vision features, while the queries are derived from the language inputs. This form of in-context learning has significant advantages over gradient-based few-shot learning methods.
Practically, for cross-attention-based models, changing the backbones to a better one (in their own respective modality) leads to a performance boost under a fixed size of pre-trained backbones. However, switching to a better-performing language model under a given budget leads to the most significant improvement for the combined system.
Self-Attention
In the self-attention architecture introduced by FROMAGe (Koh et al., 2023) and BLIP2 (Li et al., 2023), the output of the vision encoder is treated as tokens and concatenated to the sequence of text tokens. The entire sequence is then passed as input to the language model (the input to the language model is language and visual tokens.). The layers that map the vision-hidden space to the text-hidden space are known as modality projection layers.
The Idefics model family is a prominent class of VLMs that use a fully auto-regressive architecture. Laurençon et al. (2024) report that when attempting to train unimodal backbones and new parameters, the loss often diverges and leads to unstable training runs. They report that using LoRA to adapt the parameters of the unimodal backbones while using standard fine-tuning for the new parameters yields more stable training runs. In fact, cross-attention-based models perform better under frozen backbones than fully autoregressive backbones, but autoregressive backbones perform better with more degrees of freedom. However, LoRA adaptation can be done at a fraction of the GPU cost of pre-training and can be merged back at no additional inference cost.
Boosting Model Performance
Over the years, many papers have analysed model training and identified certain tips and tricks to help model performance. Let’s look at some of them:
- Because vision encoders are often trained on different datasets and optimised for various tasks, some models, like SPHINX (Lin et al., 2023), combine representations from multiple encoders, such as DINOv2 (Oquab et al., 2023) and CLIP (Radford et al., 2021), to create a richer sequence of visual embeddings. However, this comes at the expense of computational efficiency.
- Li et al. (2022), inspired by the sparse computation of Masked Auto-encoders (MAEs), propose to randomly remove a large portion of image patches during CLIP-based contrastive image-text pre-training. This allows models to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with a similar memory footprint.
- Sun et al. (2023) use pre-trained EVA models, which combine the high-level semantics of image-text contrastive learning with geometric and structural capture from masked image modelling, to improve feature representation and expedite the convergence of CLIP models.
- Chen and Wang 2022 report a stronger increase in performance by scaling the size of the vision encoder compared to scaling the size of the language model, even though scaling the vision encoder leads to a smaller parameter count increase.
- Vision encoders are typically trained on fixed-size square images. Resizing an image before encoding changes its aspect resolution and reduces quality, affecting downstream performance (changing the aspect ratio of an image with text would hinder a VQA task). Therefore, Laurençon et al. (2024) interpolate the pre-trained positional embeddings to allow for a higher resolution and train the vision encoder with LoRA parameters to adapt to these modifications.
Training Multimodal Models
Multimodal training typically occurs in multiple stages due to the following reasons:
- Limited availability of high-quality data
- Memory constraints for efficient training
- Stability issues
During these stages, progressively higher-quality data is introduced, the maximum image resolution is gradually increased, and more model parts are unfrozen.
1. Pre-Training
The primary goal of pre-training is to align the backbone models and train the newly initialised parameters in the model. To efficiently train on a large number of images, the image resolution is typically kept low at the start of training and gradually increased over time. Once the resolution is sufficiently high, datasets containing large images, such as PDFs, can be incorporated into the training data.
2. Supervised Fine-Tuning (SFT)
Having trained a general-purpose vision-language representation model during the pre-training phase, we now perform supervised fine-tuning to train the model for several tasks.
3. Alignment
But do we need to align the model after SFT further?
- Align the model’s output with human preferences, making it more intuitive and better at following complex instructions.
- Effectively reduces hallucinations, where the model might describe objects or details not actually present in the image.
- Enhances model safety by minimising the risk of generating harmful content.
Best Practices for Efficient VLM Training
Training a VLM is a multi-stage process requiring lots of training data and is computationally expensive. Moreover, even fine-tuning on specific domains is memory intensive and can only be deployed on high-resource systems. Various previous works have identified certain tips to train VLMs efficiently:
- Flamingo (Alayrac et al. 2022) was the first to use a small fixed number of visual tokens per image.
- Laurençon et al. (2024) propose using LoRA (Hu et al. 2021) fine-tuning rather than full fine-tuning. This decreases GPU consumption and the number of parameters needed to be tuned with no additional inference latency.
- More recently, models such as SmolVLM (Marafioti et al. 2024) have aggressively used the pixel shuffle strategy (Chen et al. 2024) to compress the patched visual information. This enables the model to adapt effectively to the input image's varying resolutions and aspect ratios.
An LLM with a long context window and a vision backbone with an efficient resizing method enables their use for basic video analysis tasks.
Conclusion
This article explores the evolution and key design choices in training multimodal vision language models (VLMs). It examines two main architectural approaches: cross-attention (pioneered by Flamingo) and self-attention (used in FROMAGe and BLIP2). We highlight how most modern VLMs build upon pre-trained unimodal backbones rather than training from scratch and discuss various techniques to boost performance, including masked training and resolution adaptation. It also outlines the typical three-stage training process: pre-training, supervised fine-tuning, and alignment, each serving distinct purposes in model development.
Saurav
Developer Advocate, LightlyAI