Segment Anything Model (SAM)

Sam Naji, Joseph Tekriti
Computer Vision
October 16, 2023
6 minute read
Table of Contents

The Segment Anything Model (SAM)is a novel artificial intelligence (AI) model from Meta AI that can "cutout" any image object with a single click. SAM is a promotable segmentation system, with zero-shot generalization to untrained images and objects. This makes it a potent instrument for a vast array of applications, such as image editing, photo manipulation, and robotics. A massive dataset of images and segmentation masks is used to train SAM. This dataset contains a wide range of objects, scenes, and backgrounds, allowing SAM to learn how to segment objects under a variety of conditions. They built the largest segmentation dataset to date, alongside over 1 billion masks on 11M licensed and privacy respecting images.

SAM is capable of segmenting any image or video object with high quality and efficiency. Segmentation is the separation of an object from its context or other objects, followed by the creation of a mask that outlines its shape and boundaries. The SAM model will facilitate the editing, compositing, monitoring, recognition, and analysis processes.

The focus was on developing a foundation model for image segmentation, and they intend to solve a range of downstream segmentation problems on new data distributions using prompt engineering. The success of this plan is contingent upon three factors: the task, the model, and the data. To develop them, they address the following image segmentation questions:

  1. What task will enable zero-shot generalization?
  2. What is the corresponding model architecture?
  3. What data can power this task and model?

In this article we describe the SAM mode, it requires a model that supports flexible prompting and can output segmentation masks in real-time when prompted. Their aim was to always predict a valid mask for any prompt even when the prompt is ambiguous.

Segment Anything Model (SAM)

SAM has three components, illustrated in image below, an image encoder, a flexible prompt encoder, and a fast mask decoder. These three components work together to return a valid segmentation mask:

SAM paper

Image encoder

The image encoder can be applied prior to prompting the model to generate one-time image embeddings and runs once per image. MAE pre-trained Vision Transformer (ViT) is used to obtain the image embedding.

Prompt encoder

Consider both sparse (points, boxes, text) and dense (masks) sets of cues. They represented points and boxes using positional encodings combined with learned embeddings for each prompt type. Free-form text prompts are represented with an off-the-shelf text encoder from CLIP.

Mask decoder

The mask decoder maps the image embedding, prompt embeddings, and an output token to a mask in an efficient manner. To update all embeddings, their modified decoder block employs prompt self-attention and cross-attention in two directions(prompt-to-image embedding and vice versa).

To address ambiguity in image segmentation, they modify the model to predict multiple output masks for a single prompt. This allows the model to capture different possible interpretations of the prompt. During training, they backpropagate only the minimum loss over the masks. To rank the masks, the model predicts a confidence score for each mask. They found 3 mask outputs is sufficient to address most common cases as shown in below figure.

SAM paper

At the start of this stage, SAM was trained using common public segmentation datasets. After sufficient data annotation, SAM was retrained using only newly annotated masks. As more masks were collected, the image encoder was scaled from ViT-B to ViT-H and other architectural details evolved; in total they retrained model 6 times. Average annotation time per mask decreased from 34 to14 seconds as the model improved. As SAM improved, the average number of masks per image increased from 20 to 44 masks. Overall, we collected 4.3M masks from120k images in this stage.

Segment Anything RAI Analysis

A Responsible AI (RAI) analysis was conducted by examining potential impartiality concerns and biases associated with the use of SA-1B and SAM. The focus was on the fairness of SAM across protected attributes of people.

Experiment Implementation Details

Using a collection of 23different segmentation datasets from earlier work, they created a new segmentation benchmark to assess the model's zero-shot transfer capabilities. With respect to Zero-Shot Edge Prediction, SAM with a 16×16 regular grid of fore ground points is prompted resulting in 768 predicted masks (3 per point).Redundant masks are removed by NMS. Then, edge maps are computed using Sobel filtering. Although it was not trained to predict edge maps and did not have access to BSDS images and annotations during training but produces reasonable edge maps, with high performance.

SAM paper

They trigger SAM with the boxes produced by a fully supervised ViTDet-H on COCO and LVIS v1 validation splits for Zero-Shot Instance Segmentation. By passing the most confident predicted mask and the box prompt back to the mask decoder to produce the final prediction, they perform an additional iteration of mask refinement, the results can be seen in below figure.

SAM paper

They consider an even higher-level task: segmenting objects from free-form text. Zero-ShotText-to-Mask experiment was a proof-of-concept of SAM’s ability to process text. As we can see in below figure SAM can segment objects based on simple text prompts like “a wheel” as well as phrases like “beaver tooth grille”.

SAM paper

For online demo click here and follow the steps mentioned on the website and make sure images uploaded should not violate any intellectual property rights or Facebook's Community Standards. Moreover, they specify that it is a research demo and may not be used for any commercial purpose. In the rest of the article will show how you can clone and deploy SAM on your machine.

Installation on local machine

The code requires python>=3.8,as well as pytorch>=1.7 and torchvision>=0.8. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVisionwith CUDA support is strongly recommended.

clone the repository locally and install with

git clone
cd segment-anything; pip install -e .

The following optional dependencies are necessary for mask post-processing, saving masks in COCO format, the example notebooks, and exporting the model in ONNX format. jupyter is also required to run the example notebooks.

pip install opencv-python pycocotools matplotlib onnxruntime onnx

Download a model checkpoint. Then the model can be used in just a few lines to get masks from a given prompt:

from segment_anything import SamPredictor, sam_model_registry
sam = sam_model_registry["model_type"](checkpoint="path/to/checkpoint")
predictor = SamPredictor(sam)predictor.set_image(your_image)
masks, _, _ = predictor.predict(input_prompts)

or generate masks for an entire image:

from segment_anything import SamAutomaticMaskGenerator, sam_model_registry
sam = sam_model_registry["model_type"](checkpoint="path/to/checkpoint")
mask_generator = SamAutomaticMaskGenerator(sam)
masks = mask_generator.generate(your_image)

Additionally, masks can be generated for images from the command line:

python scripts/ --checkpoint path/to/checkpoint --model-type model_type --input image_or_folder --output path/to/output

See the examples notebooks on using SAM with prompts and automatically generating masks for more details.

SAM Repo


The Segment Anything Model (SAM)introduced by Meta AI represents a significant advancement in image segmentation technology. By utilizing a large-scale dataset and an intricate three-component system, SAM achieves unparalleled efficiency in segmenting any image or video object, even under zero-shot conditions. Its ability to interpret both visual cues and free-form text prompts extends its application in various domains from image editing to robotics. The model's flexibility, combined with its responsible AI analysis to ensure fairness, makes it a comprehensive tool in the AI landscape. With straightforward installation and deployment instructions provided, SAM promises to make advanced image segmentation more accessible and user-friendly for both developers and researchers alike.

Join Our Newsletter

Stay informed with the latest in AI research, updates, and insights directly to your inbox

Subscribe Now

More our similar blogs

You might also like

November 28, 2023

Using Gen AI to reduce reliance on human labers


Sam Naji, Joseph Tekriti
November 25, 2023

Is That Picture Real?


Sam Naji, Joseph Tekriti
November 24, 2023

Advanced Prompting Frameworks


Sam Naji, Joseph Tekriti