Segment Anything Model and Friends

Recent years have witnessed the emergence of language models pre-trained on enormous corpora of unlabeled data. These systems demonstrate the ability to perform new tasks with minimal or no specific training. Research shows that this behavior increases with model scale, dataset size, and training compute. These capabilities are often guided by engineered text prompts that provide the necessary guidance to direct the system’s output toward desired results.

Their vision counterparts, however, have been explored to a lesser extent (Segment Anything, 2023). Models such as CLIP (2021) and ALIGN (2021) provide strong baselines for aligning images and text with separate encoders pre-trained using a contrastive objective. While vision-language models have seen significant advancement, core computer vision tasks such as segmentation have yet to progress at the same pace. This discrepancy is likely due to the challenges in obtaining large-scale, high-quality annotated datasets for tasks like segmentation.

Figure: Summary of recent SAM variants

This article will dive into the Segment Anything (SAM) family of models, their architectures, applications, and performance.

Segment Anything (SAM 1) Where it All Started

Kirillov et al. introduced the Segment Anything Model (SAM) and its corresponding dataset in Segment Anything, 2023 to build a foundational model for image segmentation with the aim:

To develop a promptable model pre-trained on a broad dataset using a task that enables powerful generalization.

Notably, the success of the SAM project can be attributed to its three core components:

  1. A task formulation that enables zero-shot generalization.
  2. A model architecture that allows for such flexibility.
  3. A comprehensive dataset to allow the model to perform well at the task.

Figure: The three fundament components of the Segment Anything Project: Promptable Segmentation Task, Segment Anything Model, and the Segment Anything Dataset. Source: Figure 1 from Segment Anything, 2023

Let’s look into each component in detail.

Segment Anything Task

Inspired by the success of next-token prediction in NLP the authors translate the idea of a “prompt” to segmentation. A prompt in this case can be a set of foreground/background points, a rough box or mask, free-form text, or in general any source of information about what to segment in an image.

The promptable segmentation task then is to return a valid segmentation mask given any “prompt”.

This notion of a valid segmentation mask is of importance since it helps deal with ambiguity wherein a prompt could refer to any number of objects. In particular, this forces the model to output a reasonable mask for at least one of those objects. This is illustrated by the following image:

Figure: Each of these columns represents valid masks generated by SAM based on a single ambiguous point. Source: Figure 3 from Segment Anything, 2023

Since the objective is to always predict a valid mask even if the prompt is ambiguous this ensures that the pre-trained model is effective in use cases that involve ambiguity, this comes in useful in the automatic annotation in the SAM data engine. Moreover, it leads to a general method for zero-shot transfer to downstream segmentation tasks via prompting. For example, one can feed the model with bounding boxes outputted from an object detector as the input prompt.

An important clarification to make here is that SAM is inherently different from prior works in multi-task segmentation systems, where a model is pre-trained and evaluated on a fixed set of tasks. A model trained for promptable segmentation is capable of performing new, different tasks at inference time by acting as a component in a bigger system (much like how DALL-E uses CLIP as the text-image alignment component).

Segment Anything Model

The promptable segmentation task requires a model that can support flexible prompting and can output segmentation masks in amortized real-time to allow for interactive use. The authors build on top of recent work in Transformer vision models while making careful adjustments to allow for speed and efficiency.

Figure: Segment Anything Model Architecture. Source: Figure 4from Segment Anything, 2023

The Segment Anything Model has 3 key components:

  1. Image Encoder: Owing to the success of Self-Supervised pre-training methods, the authors use a Masked Auto-Encoder (MAE) adapted to process high-resolution images. This encoder runs once per image and can be applied before prompting the model. You can read more about MAEs on our blog.
  2. Flexible Prompt Encoder: We can broadly classify possible prompts into sparse (points, boxes, and text) and dense (masks).
    Sparse Prompts: The authors represent points and boxes using positional embeddings summed with learned embeddings for each prompt type while text prompts are processed using an off-the-shelf text encoder from CLIP.
    Dense Prompts: Mask prompts are embedded using convolutions and summed element-wise with image embeddings.
  3. Fast Mask Decoder: The decoder is tasked with efficiently mapping image embeddings, prompt embeddings, and an output mask token to a valid segmentation mask. The authors base their decoder on the Transformer decoder block with a modified dynamic mask prediction head that uses prompt self-attention and cross-attention to update all embeddings. An MLP is then used to map the output token to a dynamic linear classifier that is tasked with computing mask foreground probabilities at each image location.

To deal with ambiguity, the authors modify the model to predict multiple output segmentation masks for a single prompt. The outputs are thus accompanied by the model confidence scores (IoU) for each mask. The model is trained with a linear combination of focal loss and dice loss by applying backprop to the minimum loss over masks.

Segment Anything Data Engine and Dataset

To achieve strong generalization it was essential to train the model on a large and diverse set of masks. Since such a dataset was not available publicly the authors developed a “data-engine”, which allowed them to co-develop the model by using model-in-the-loop dataset annotations. Thus, they could iterate between using the model to assist in data collection and using this newly collected data to improve the model. The data engine has three stages:

  1. Assisted Manual: In this stage, SAM assists annotators in annotating masks similar to a classic interactive segmentation setting.
  2. Semi-automatic: In this stage, SAM can automatically generate masks for a subset of objects by prompting it with likely object locations while annotators can focus on annotating the remaining objects helping increase mask diversity.
  3. Fully automatic: In this stage, SAM is prompted with a regular grid of foreground points which it uses to output high-quality masks.

Using this data engine the authors were able to release the Segment Anything 1B dataset containing more than 1B masks from 11M licensed and privacy-preserving images containing 400x more masks than any existing segmentation dataset.

Segment Anything Model v1: Results

The authors report zero-shot transfer performance on five tasks that evaluated SAM on datasets and tasks that were not seen during training, including novel image distributions.

Zero-Shot Single Point Valid Mask Evaluation


The authors tested  the model with the task of evaluating segmenting an object from a single foreground point. This task is particularly ill-poised as one point can refer to multiple objects. Since SAM is capable of predicting multiple masks they evaluate only the most confident mask by default. SAM yields higher results on 16 of the 23 datasets by a considerable margin. Moreover, since ground truth masks in most datasets do not enumerate all possible masks they supplement the standard mIoU metric with a human study. The annotators consistently rate the quality of SAM’s masks substantially higher than the strongest baseline.

Figure: Mask quality ratings for zero-shot single-point valid mask evaluation

These results indicate that SAM has learned to segment valid masks from a single point.

Zero-Shot Edge Detection

The authors evaluated SAM on the classic low-level task of edge detection using a simplified version of the automatic mask generation pipeline of the data engine. Compared to the ground truth, SAM predicts more edges, and qualitatively even though SAM was not trained for edge detection it produces reasonable edge maps.

Figure: Zero-shot transfer to edge detection

Zero-Shot Object Proposals

When evaluated on the mid-level task of object proposal generation, SAM does remarkably well on several metrics. It outperforms ViTDet-H on medium and large objects as well as rare and common objects. SAM only underperforms VitDet-H on small objects and frequent objects, likely since VitDet-H was trained on LVIS, unlike SAM.

Figure: Object Proposal Generation on LVIS

Zero-Shot Instance Segmentation

SAM can also be used as the segmentation module of an instance segmenter. SAM performs reasonably close to VitDet-H (even outperforming in human studies) and qualitatively produces better and crisper masks.

Figure: Instance Segmentation results when performing zero-shot transfer

Zero-Shot Text-to-Mask

When evaluated on segmenting objects from free-form text, SAM is successfully able to segment objects based on simple text prompts. When it does fail to pick the right object, an additional point often fixes the prediction.

Figure: SAM’s ability to produce segmentation masks based on simple text prompts

🏃 Making SAM faster

FastSAM

SAM was regarded as a milestone vision foundation model with its ability to segment any object within the image, guided by various possible user interaction prompts. However, its industrial impact was limited by its computational cost.

SAM uses a transformer backbone (worst-case quadratic complexity in terms of tokens), at higher input resolutions it has a heavy computational resource demand, which presents a hurdle to its practical deployment, especially in real-time applications. Zhao et al. in Fast Segment Anything, 2023 decoupled the segment anything task introduced by SAM into two sequential stages relying on a CNN-based detector.

By directly training a CNN detector on only 2% (1/50) of the SA-1B dataset, the authors achieved comparable performance to SAM, but with drastically reduced computational and resource demands, enabling real-time application.

Figure: FastSAM framework. Source: Figure 2 from Fast Segment Anything, 2023

The authors propose to break down the Segment Anything Task into All-instance segmentation and Prompt-guided selection. The first stage produces the segmentation masks of all instances in the image using a YOLOv8-based CNN backbone, while the second stage outputs the region of interest corresponding to the prompt.

Having been trained on only 2% of the SA-1B dataset, FastSAM offers competitive performance to SAM with faster inference times.

Figure: Comparison of running speeds FastSAM vs SAM. Source: Table 1 from Fast Segment Anything, 2023

Figure: FastSAM performs comparable to SAM on zero-shot transfer to edge detection. Source: Table 2 from Fast Segment Anything, 2023

MobileSAM

Improving on FastSAM, the authors of MobileSAM, 2023 propose to distill the knowledge from a heavy image encoder to a lightweight image encoder enabling 5x speedup than the concurrent FastSAM and 7x smaller size.

Performing Knowledge Distillation on SAM proves to be a challenge since optimization of the image encoder depends on the quality of the image decoder, and vice versa. When the two modules in the SAM are both in a bad state, it is more challenging to train them both to a good state. The authors therefore propose to distill the small image encoder directly from the encoder in the original SAM without resorting to the combined decoder. This brings the benefit of a readily used combined decoder for finetuning instead of training it from scratch.

Figure: Proposed Decoupled Knowledge Distillation paradigm as introduced in MobileSAM, 2023.

Compared to SAM, MobileSAM offers comparable performance at a much smaller size and faster inference time.

Figure: SAM vs MobileSAM in terms of model size and inference speed as measured on a single GPU. Source: Table 3 from MobileSAM, 2023

The authors also provide qualitative results, when comparing SAM, MobileSAM, and FastSAM.

Figure: Comparison of generated masks. Source: Figure 6 from MobileSAM, 2023.

EfficientSAM

Xiong et al. propose an interesting framework in EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything, 2023 exploiting recent success in Masked Pre-training and Knowledge Distillation.

Figure: EfficientSAM framework. Source: Figure 2 from EfficientSAM, 2023.

They propose SAM-leveraged masked image pre-training (SAMI) that uses a SAM encoder (ViT-H) to generate feature embeddings to train a masked image model with lightweight encoders to reconstruct features from ViT-H of SAM instead of image patches. This leads to generalized ViT backbones, which can be used for downstream tasks such as image classification, object detection, and the segment anything task.

At each training iteration, SAMI consists of a feedforward feature extraction from the SAM image encoder, and a feedforward and a backpropagation procedure of MAE. The outputs from the SAM image encoder and MAE linear projection head are compared to compute the reconstruction loss.

This can be seen as a form of Knowledge Distillation, to read more about Knowledge Distillation and its connections with masked pre-training, please refer to our article.

This SAMI pre-trained light-weight encoder is then served as the image encoder of EfficientSAM for finetuning on SA-1B.

The authors evaluate performance on point-based and box-based prompt segmentation and report comparisons with SAM, FastSAM, and MobileSAM.

Figure: Comparison on Zero-shot instance segmentation. Source: Table 5 from EfficientSAM, 2023.

Figure: Qualitative analysis of point prompts. Source: Figure 3  from EfficientSAM, 2023.

SAM 2

Ravi et al. generalize the notion of the Segmentation Anything Task to videos as a Promptable Visual Segmentation (PVS) task in the latest SAM 2 2024 paper. Following the recent push toward multi-modal models, the authors aim to create a unified model for video and image segmentation.

Figure: SAM 2 Framework. Source: Figure 1 from SAM 2, 2024

Similar to SAM, this version also includes a task, a model, and a data engine + dataset.

You can play around with a CPU version of the model on this app

Promptable Visual Segmentation (PVS) Task

This generalization of the Segment Anything task allows us to provide prompts to the model on any video frame. Prompts can be positive/negative clicks, bounding boxes, or masks, either to define an object to segment or to refine a model-predicted one.

Moreover, to work in an interactive setting the model must immediately respond with a valid segmentation mask of the object on this frame upon receiving an input.

Figure: Propagation of input prompts in SAM 2. Source: Figure 2 from SAM 2, 2024

Upon receiving input, the model propagates these prompts to obtain the masklet (a masked video block) of the required object across the entire video. Moreover, additional prompts can be provided to the model on any frame to refine the segment throughout the video.

This allows them to segment objects across videos with a good interactive experience and build a strong model with a large and diverse dataset.

Figure: Previously studied tasks such as Segment Anything (SA) and semi-supervised Video Object Segmentation (VOS) can be seen as special cases of the PVS task. Source: Figure 8 from SAM 2, 2024

SAM 2 Model

The authors extend the model architecture of SAM and adapt it to make it work with videos. SAM 2 takes as input a stream of video frames along with point, box, or mask prompts and outputs segmentation masks.

Figure: SAM 2 model architecture. Source: Figure 3 from SAM 2, 2024

  • A Hiera pre-trained MAE encoder is used to generate image embeddings of each frame.
  • A transformer-based memory attention block is used to condition the current frame features on the past frames' features + predictions as well as on any new prompts. This allows the model to learn spatio-temporal features.
  • SAM 2 uses the same prompt encoder as SAM
  • Unlike SAM, however, for the PVS task, there might be cases where no valid masks exist for a given frame due to occlusion. To account for this the authors add a head that predicts whether the object of interest is present in the current frame.
  • A memory encoder and bank are used to retain information about predictions for past frames.

SAM 2 Model Performance

The authors report significant performance improvements for zero-shot video tasks and image tasks.

  • Human annotators using SAM 2 are 8x faster in labeling a frame with SAM 2 than SAM 1.
  • SAM 2 shows significant improvement over the best existing methods in both accuracy and inference speed

Figure: Comparison to prior work on Video Object Segmentation. Source: Table 7  from SAM 2, 2024.

  • Compared to SAM, SAM 2 also performs better on the Segment Anything Task.

Figure: Zero-shot accuracy on the Segment Anything (SA) task. Source: Table 6 from SAM 2, 2024.

Conclusion

The Segment Anything Model (SAM) and its successors made a significant leap forward in computer vision, particularly in image and video segmentation. Along with SAM’s innovative approach to promptable segmentation, the literature rapidly evolved to address key challenges like computational efficiency and real-time performance. FastSAM, MobileSAM, and EfficientSAM each brought unique optimizations, dramatically reducing model size and inference time while maintaining competitive performance.

The latest iteration, SAM 2, extends these capabilities to video, introducing the Promptable Visual Segmentation task and demonstrating impressive results in both image and video domains. These advancements showcase the power of foundation models in computer vision, mirroring the success of large language models in NLP.