Skip to content

SAM 3

What It Is

  • SAM 3 is Meta's Segment Anything Model 3.
  • The paper title is "SAM 3: Segment Anything with Concepts".
  • It extends promptable segmentation from visual prompts to Promptable Concept Segmentation.
  • Concept prompts can be short noun phrases, image exemplars, or combinations of the two.
  • The model detects, segments, and tracks all matching object instances in images and videos.
  • It is a foundation segmentation model, not an autonomous-driving-specific detector.
  • Meta released code, checkpoints, examples, and the SA-Co benchmark.

Core Technical Idea

  • Decouple concept recognition from precise mask localization.
  • Use a shared backbone for image-level detection and memory-based video tracking.
  • Add a presence head to improve decisions about whether a queried concept exists in the scene.
  • Train with hard negatives so visually similar but wrong concepts are rejected.
  • Scale supervision with a data engine that annotates millions of open-vocabulary concepts.
  • Return masks and stable identities for all instances matching the prompt.
  • Unify image segmentation and video tracking under one promptable concept interface.

Inputs and Outputs

  • Inputs can be still images or videos.
  • Prompt inputs include text concepts, image exemplars, points, boxes, and masks in the released code path.
  • Text prompts are short object or concept phrases rather than full task instructions.
  • Outputs are instance masks for all matching objects.
  • Video outputs include object identities tracked over time.
  • The model can also support classic visual segmentation prompts from the SAM lineage.

Architecture or Evaluation Protocol

  • SAM 3 combines an image-level detector with a memory-based tracker.
  • Recognition and localization are separated to reduce interference between concept matching and mask quality.
  • The presence head checks whether a concept is present before mask prediction dominates the decision.
  • The SA-Co benchmark evaluates Promptable Concept Segmentation at large vocabulary scale.
  • The official GitHub describes SA-Co as containing 270K unique concepts.
  • The data engine is reported to annotate over 4M unique concepts.
  • The paper reports about a 2x gain over existing systems in image and video PCS.

Training and Evaluation

  • SAM 3 is trained with large-scale image and video segmentation data plus concept labels.
  • Hard negatives are included to improve fine-grained text discrimination.
  • The released repository includes inference and finetuning code.
  • The official Meta page reports improvements over previous SAM capabilities on visual segmentation tasks.
  • The GitHub page reports 75 to 80 percent of human performance on SA-Co.
  • Evaluation includes image PCS, video PCS, and prior SAM-style promptable segmentation tasks.

Strengths

  • Text and exemplar prompts make segmentation usable without drawing precise boxes for every object.
  • Returning all matching instances is more useful than single-object click segmentation for scene parsing.
  • Video identity tracking helps with temporal review and annotation.
  • The model is broadly useful for label generation, dataset triage, and open-vocabulary mask extraction.
  • Official code and checkpoints make it a practical foundation component.
  • Hard negatives and a presence head directly address prompt confusion.

Failure Modes

  • It outputs 2D or video masks, not metric 3D boxes, velocity, or occupancy.
  • Text prompts can be under-specified; "cart" or "loader" may mean different airside objects.
  • It can still miss small, thin, distant, reflective, or heavily occluded items.
  • Video identities can drift under long occlusion or camera cuts.
  • Compute and memory requirements may be high for embedded deployment.
  • Licensing and checkpoint terms need review before commercial use.

Airside AV Fit

  • SAM 3 is highly useful for airside data annotation and open-vocabulary mask mining.
  • It can segment all instances of prompts such as "traffic cone", "baggage cart", or "tow bar" in images and video.
  • Exemplar prompts are useful when airport-specific objects lack stable public names.
  • Runtime use needs fusion with depth or LiDAR before masks become drivable-space constraints.
  • It should be tested on night operations, glare, rain, aircraft reflections, and apron paint.
  • Best near-term fit is human-in-the-loop labeling, perception QA, and offline dataset expansion.

Implementation Notes

  • Build prompt templates and exemplar banks for airport equipment to reduce text ambiguity.
  • Save the exact prompt, model checkpoint, and post-processing threshold with each generated mask.
  • Use video mode for annotation consistency, but still audit identity switches.
  • Pair masks with camera calibration and depth to create 3D training labels.
  • Add negative prompts or hard-negative review sets for aircraft parts versus ground equipment.
  • Do not treat SAM 3 masks as safety-certified obstacle detections without independent validation.

Sources

Public research notes collected from public sources.