Skip to content

Clipomaly

What It Is

  • Clipomaly is a language-guided open-world anomaly segmentation method for autonomous driving.
  • The paper title is "Language-Guided Open-World Anomaly Segmentation".
  • It is a CLIP-based zero-shot method that segments unknown objects and assigns human-readable names.
  • It does not require anomaly-specific training data.
  • It targets the gap between anomaly segmentation, which finds unknown regions, and open-vocabulary segmentation, which requires a fixed vocabulary.
  • Clipomaly dynamically extends the inference vocabulary when it discovers anomalies.
  • It is camera-image segmentation, not 3D detection.

Core Technical Idea

  • Use CLIP's shared image-text embedding space to detect regions that do not match known driving classes.
  • Generate candidate names for unknown regions with either a dictionary strategy or an image tagging model such as RAM.
  • Extend the segmentation vocabulary at inference time with those candidate names.
  • Run an open-vocabulary segmentation model over the known classes plus the new candidate unknown labels.
  • Produce both fine-grained semantic labels and a binary known/unknown anomaly mask.
  • Avoid retraining when a new anomaly name is introduced.
  • Use language to make anomalies interpretable rather than only marking them as "unknown".

Inputs and Outputs

  • Inputs are RGB driving images and a known-class vocabulary such as Cityscapes classes.
  • Optional inputs include a candidate word dictionary or RAM-generated tags.
  • The pipeline uses CLIP-region matching to score candidate unknown labels.
  • Outputs include semantic segmentation for known classes.
  • Outputs also include anomaly masks for unknown objects.
  • Unknown regions are assigned labels such as object names or descriptive candidate terms.

Architecture or Evaluation Protocol

  • The method has three main stages: unknown mask prediction, anomaly naming, and open-vocabulary segmentation.
  • Unknown mask prediction uses dense CLIP image-text similarity relative to known classes.
  • Candidate naming can use a lightweight dictionary preselection strategy.
  • A RAM-based variant uses an image tagging model for richer candidate words.
  • The extended vocabulary is passed to an open-vocabulary segmentation backbone.
  • Post-processing separates pixels assigned to known labels from pixels assigned to newly added unknown labels.
  • Evaluation covers anomaly segmentation and open-world segmentation settings.

Training and Evaluation

  • Clipomaly is described as zero-shot with no anomaly-specific training data.
  • It is evaluated on RoadAnomaly and Segment-Me-If-You-Can AnomalyTrack.
  • The paper also discusses open-world settings with Cityscapes and BDD-Anomaly.
  • Reported RoadAnomaly results include 57.8 mIoU and 84.74 AUPR for the RAM plus CLIP-Best variant.
  • Reported SMIYC results include 75.1 mIoU and 94.74 AUPR for the same variant.
  • A dictionary variant is lighter and still competitive, with reported RoadAnomaly mIoU above prior methods.

Strengths

  • Adds semantic names to anomalies, which helps triage and downstream reasoning.
  • Zero-shot operation makes it useful before domain-specific anomaly labels exist.
  • Dynamic vocabulary extension is better aligned with open-world deployment than fixed prompt lists.
  • The method can reuse existing open-vocabulary segmentation backbones.
  • It is directly motivated by autonomous driving anomaly segmentation benchmarks.
  • The RAM and dictionary variants provide a tradeoff between semantic richness and compute.

Failure Modes

  • Candidate labels can be wrong, too generic, or operationally misleading.
  • CLIP similarity can confuse visually similar objects with different hazards.
  • RAM-based tagging is computationally heavier and may not fit embedded runtime budgets.
  • Dictionary-based naming can miss airport-specific or local terminology.
  • The method segments in 2D and does not estimate object depth, velocity, or 3D envelope.
  • Known-class segmentation quality can degrade if the extended vocabulary introduces confusing terms.

Airside AV Fit

  • Clipomaly is relevant for camera anomaly alerts on ramps, service roads, and stand areas.
  • Semantic anomaly labels could help remote operators distinguish debris, animals, equipment, and unusual vehicles.
  • The dynamic vocabulary concept is attractive for airports because object taxonomies vary by operator and geography.
  • It needs an airside candidate dictionary and evaluation set before use on apron video.
  • Outputs should feed a 3D localization layer before vehicle planning reacts to them.
  • It is best suited for perception monitoring, review, and data mining rather than sole safety perception.

Implementation Notes

  • Use an airport-specific known-class list and keep candidate anomaly labels separate from approved production classes.
  • Log the candidate-name source, such as RAM or dictionary, for each anomaly.
  • Validate false positives on markings, shadows, aircraft parts, reflections, and wet pavement.
  • Pair 2D anomaly masks with LiDAR, stereo, or monocular depth before assigning hazard zones.
  • Tune vocabulary extension conservatively so common known objects are not relabeled as unknowns.
  • Build a review loop to promote repeated anomaly names into the training taxonomy.

Sources

Public research notes collected from public sources.