SAM 3

What It Is

SAM 3 is Meta's Segment Anything Model 3.
The paper title is "SAM 3: Segment Anything with Concepts".
It extends promptable segmentation from visual prompts to Promptable Concept Segmentation.
Concept prompts can be short noun phrases, image exemplars, or combinations of the two.
The model detects, segments, and tracks all matching object instances in images and videos.
It is a foundation segmentation model, not an autonomous-driving-specific detector.
Meta released code, checkpoints, examples, and the SA-Co benchmark.

Decouple concept recognition from precise mask localization.
Use a shared backbone for image-level detection and memory-based video tracking.
Add a presence head to improve decisions about whether a queried concept exists in the scene.
Train with hard negatives so visually similar but wrong concepts are rejected.
Scale supervision with a data engine that annotates millions of open-vocabulary concepts.
Return masks and stable identities for all instances matching the prompt.
Unify image segmentation and video tracking under one promptable concept interface.

Inputs can be still images or videos.
Prompt inputs include text concepts, image exemplars, points, boxes, and masks in the released code path.
Text prompts are short object or concept phrases rather than full task instructions.
Outputs are instance masks for all matching objects.
Video outputs include object identities tracked over time.
The model can also support classic visual segmentation prompts from the SAM lineage.

SAM 3 combines an image-level detector with a memory-based tracker.
Recognition and localization are separated to reduce interference between concept matching and mask quality.
The presence head checks whether a concept is present before mask prediction dominates the decision.
The SA-Co benchmark evaluates Promptable Concept Segmentation at large vocabulary scale.
The official GitHub describes SA-Co as containing 270K unique concepts.
The data engine is reported to annotate over 4M unique concepts.
The paper reports about a 2x gain over existing systems in image and video PCS.

SAM 3 is trained with large-scale image and video segmentation data plus concept labels.
Hard negatives are included to improve fine-grained text discrimination.
The released repository includes inference and finetuning code.
The official Meta page reports improvements over previous SAM capabilities on visual segmentation tasks.
The GitHub page reports 75 to 80 percent of human performance on SA-Co.
Evaluation includes image PCS, video PCS, and prior SAM-style promptable segmentation tasks.

Text and exemplar prompts make segmentation usable without drawing precise boxes for every object.
Returning all matching instances is more useful than single-object click segmentation for scene parsing.
Video identity tracking helps with temporal review and annotation.
The model is broadly useful for label generation, dataset triage, and open-vocabulary mask extraction.
Official code and checkpoints make it a practical foundation component.
Hard negatives and a presence head directly address prompt confusion.

It outputs 2D or video masks, not metric 3D boxes, velocity, or occupancy.
Text prompts can be under-specified; "cart" or "loader" may mean different airside objects.
It can still miss small, thin, distant, reflective, or heavily occluded items.
Video identities can drift under long occlusion or camera cuts.
Compute and memory requirements may be high for embedded deployment.
Licensing and checkpoint terms need review before commercial use.

SAM 3 is highly useful for airside data annotation and open-vocabulary mask mining.
It can segment all instances of prompts such as "traffic cone", "baggage cart", or "tow bar" in images and video.
Exemplar prompts are useful when airport-specific objects lack stable public names.
Runtime use needs fusion with depth or LiDAR before masks become drivable-space constraints.
It should be tested on night operations, glare, rain, aircraft reflections, and apron paint.
Best near-term fit is human-in-the-loop labeling, perception QA, and offline dataset expansion.

Build prompt templates and exemplar banks for airport equipment to reduce text ambiguity.
Save the exact prompt, model checkpoint, and post-processing threshold with each generated mask.
Use video mode for annotation consistency, but still audit identity switches.
Pair masks with camera calibration and depth to create 3D training labels.
Add negative prompts or hard-negative review sets for aircraft parts versus ground equipment.
Do not treat SAM 3 masks as safety-certified obstacle detections without independent validation.