Clipomaly

What It Is

Clipomaly is a language-guided open-world anomaly segmentation method for autonomous driving.
The paper title is "Language-Guided Open-World Anomaly Segmentation".
It is a CLIP-based zero-shot method that segments unknown objects and assigns human-readable names.
It does not require anomaly-specific training data.
It targets the gap between anomaly segmentation, which finds unknown regions, and open-vocabulary segmentation, which requires a fixed vocabulary.
Clipomaly dynamically extends the inference vocabulary when it discovers anomalies.
It is camera-image segmentation, not 3D detection.

Use CLIP's shared image-text embedding space to detect regions that do not match known driving classes.
Generate candidate names for unknown regions with either a dictionary strategy or an image tagging model such as RAM.
Extend the segmentation vocabulary at inference time with those candidate names.
Run an open-vocabulary segmentation model over the known classes plus the new candidate unknown labels.
Produce both fine-grained semantic labels and a binary known/unknown anomaly mask.
Avoid retraining when a new anomaly name is introduced.
Use language to make anomalies interpretable rather than only marking them as "unknown".

Inputs are RGB driving images and a known-class vocabulary such as Cityscapes classes.
Optional inputs include a candidate word dictionary or RAM-generated tags.
The pipeline uses CLIP-region matching to score candidate unknown labels.
Outputs include semantic segmentation for known classes.
Outputs also include anomaly masks for unknown objects.
Unknown regions are assigned labels such as object names or descriptive candidate terms.

The method has three main stages: unknown mask prediction, anomaly naming, and open-vocabulary segmentation.
Unknown mask prediction uses dense CLIP image-text similarity relative to known classes.
Candidate naming can use a lightweight dictionary preselection strategy.
A RAM-based variant uses an image tagging model for richer candidate words.
The extended vocabulary is passed to an open-vocabulary segmentation backbone.
Post-processing separates pixels assigned to known labels from pixels assigned to newly added unknown labels.
Evaluation covers anomaly segmentation and open-world segmentation settings.

Clipomaly is described as zero-shot with no anomaly-specific training data.
It is evaluated on RoadAnomaly and Segment-Me-If-You-Can AnomalyTrack.
The paper also discusses open-world settings with Cityscapes and BDD-Anomaly.
Reported RoadAnomaly results include 57.8 mIoU and 84.74 AUPR for the RAM plus CLIP-Best variant.
Reported SMIYC results include 75.1 mIoU and 94.74 AUPR for the same variant.
A dictionary variant is lighter and still competitive, with reported RoadAnomaly mIoU above prior methods.

Adds semantic names to anomalies, which helps triage and downstream reasoning.
Zero-shot operation makes it useful before domain-specific anomaly labels exist.
Dynamic vocabulary extension is better aligned with open-world deployment than fixed prompt lists.
The method can reuse existing open-vocabulary segmentation backbones.
It is directly motivated by autonomous driving anomaly segmentation benchmarks.
The RAM and dictionary variants provide a tradeoff between semantic richness and compute.

Candidate labels can be wrong, too generic, or operationally misleading.
CLIP similarity can confuse visually similar objects with different hazards.
RAM-based tagging is computationally heavier and may not fit embedded runtime budgets.
Dictionary-based naming can miss airport-specific or local terminology.
The method segments in 2D and does not estimate object depth, velocity, or 3D envelope.
Known-class segmentation quality can degrade if the extended vocabulary introduces confusing terms.

Clipomaly is relevant for camera anomaly alerts on ramps, service roads, and stand areas.
Semantic anomaly labels could help remote operators distinguish debris, animals, equipment, and unusual vehicles.
The dynamic vocabulary concept is attractive for airports because object taxonomies vary by operator and geography.
It needs an airside candidate dictionary and evaluation set before use on apron video.
Outputs should feed a 3D localization layer before vehicle planning reacts to them.
It is best suited for perception monitoring, review, and data mining rather than sole safety perception.

Use an airport-specific known-class list and keep candidate anomaly labels separate from approved production classes.
Log the candidate-name source, such as RAM or dictionary, for each anomaly.
Validate false positives on markings, shadows, aircraft parts, reflections, and wet pavement.
Pair 2D anomaly masks with LiDAR, stereo, or monocular depth before assigning hazard zones.
Tune vocabulary extension conservatively so common known objects are not relabeled as unknowns.
Build a review loop to promote repeated anomaly names into the training taxonomy.