Open-World OOD And Anomaly Segmentation Benchmarks

Last updated: 2026-05-09

Open-world anomaly segmentation benchmarks test whether a semantic perception model can localize objects or regions outside its closed training taxonomy. This matters for autonomy because high in-distribution segmentation accuracy does not prove that the model will detect an unknown object on the drivable surface, a dropped load, an unusual apron object, or a novel obstruction.

Scope

Benchmark	Scope	Best use
SegmentMeIfYouCan	Real-world road anomaly and road obstacle segmentation with public leaderboard and benchmark suite	Open-world image anomaly and obstacle localization.
RoadAnomaly21	Pixel-wise anomaly segmentation where unknown categories can appear anywhere in the image	Strict semantic unknown detection.
RoadObstacle21	Road-obstacle segmentation focused on objects on the drivable area, known or unknown	Drivable-surface hazard localization.
Fishyscapes	Anomaly detection for semantic segmentation, including real captured Lost and Found style data and changing web anomalies	Legacy/open-world comparison and uncertainty method screening.

This page is about anomaly and OOD segmentation. It is not a weather robustness page and it is not a substitute for dedicated FOD datasets.

What They Measure

Measurement question	Why it matters
Can the model localize a previously unseen object category?	Closed-set segmentation often assigns unknown objects to known classes with high confidence.
Can it detect small obstacles on the road surface?	Small, low-profile hazards can be safety-critical even if their semantic class is unknown.
Does the anomaly score form coherent components?	Pixel-wise metrics can reward noisy masks that are poor for planning.
Does performance survive dataset and scene diversity?	Overfitting to a small set of anomalies is a known risk.
Does the system abstain or flag uncertainty instead of hallucinating a known class?	Runtime assurance needs a usable unknown/uncertain signal.

SegmentMeIfYouCan explicitly separates anomalous object segmentation from road obstacle segmentation. That distinction is useful for AV and airside validation: an unknown object anywhere in the image is not the same hazard as an unknown object occupying the intended path.

Sensors And Labels

Dataset	Sensors	Labels
RoadAnomaly21	RGB images	Pixel-wise anomaly, non-anomaly, and void labels; hidden test labels for leaderboard evaluation.
RoadObstacle21	RGB images	Pixel-wise obstacle, non-obstacle, and void labels focused on the road region of interest.
Fishyscapes Lost and Found style data	RGB imagery captured with Cityscapes-like setup	Dense anomaly annotations for image-based semantic segmentation evaluation.
Fishyscapes web anomalies	RGB composites or changing web-sourced anomalies depending on benchmark mode	Anomaly masks designed to test open-world generalization.

These are image segmentation benchmarks. They are not LiDAR, radar, or multimodal fusion datasets.

Metrics And Tasks

Metric family	Use
AUROC	Measures ranking quality between anomaly and normal pixels, but can hide poor localization under class imbalance.
AUPRC / AP	More informative for rare anomaly pixels; report anomaly-pixel AP separately.
FPR at high TPR	Useful for operational thresholds where missed hazards are costly.
Pixel-wise IoU / F1	Measures segmentation overlap once a threshold is chosen.
Component-wise metrics	Measures object-level localization quality and reduces size bias.
Leaderboard score	Useful for external comparison, but preserve raw metric tables for safety review.

For safety cases, choose an operating threshold before final test evaluation. Post-hoc threshold tuning on hidden or target-domain test data weakens the evidence.

Strengths

Directly targets the closed-set failure mode of semantic segmentation.
SegmentMeIfYouCan provides both anomaly and road-obstacle tasks.
RoadAnomaly21 and RoadObstacle21 emphasize real images and diverse scenes rather than only synthetic anomalies.
Public leaderboards reduce the risk of unverifiable local-only comparisons.
Component-wise metrics are more relevant to planning than pixel-only summaries.
Fishyscapes remains a useful comparison point for anomaly and uncertainty methods.

Gaps And Risks

RGB-only benchmarks do not test LiDAR/radar confirmation of unknown obstacles.
Road-domain anomalies do not cover aircraft servicing equipment, FOD, jet-bridge structures, cones, chocks, or baggage on aprons.
Unknown-object definitions are tied to source training taxonomies such as Cityscapes; airside unknowns need a domain-specific taxonomy.
Pixel-level anomaly detection can produce masks that are hard to convert into stable tracks.
Very small debris can be below the spatial scale represented in road anomaly leaderboards.
A high anomaly score is not a class label; downstream planning still needs localization, persistence, and risk classification.

AV And Airside Fit

Use case	Fit	Notes
Detecting unknown road obstacles	Strong	RoadObstacle21 is directly aligned with drivable-surface hazard detection.
Semantic OOD screening	Strong	RoadAnomaly21 and Fishyscapes expose overconfident closed-set models.
Airport apron unknown-object detection	Moderate proxy	Useful for method selection, but target-domain apron data is mandatory.
FOD detection	Weak to moderate	Use only as an OOD pre-screen; FOD requires small-object datasets and physical validation.
Runtime assurance	Moderate	Anomaly scores can feed monitors, but must be calibrated and tracked.

For airside validation, use these benchmarks to select anomaly segmentation methods, then build an airport-specific unknown-object suite with objects such as loose straps, tools, chocks, cones in unusual positions, baggage, temporary signs, and maintenance equipment.

Implementation And Evaluation Notes

Evaluate the normal semantic segmentation output and the anomaly head together. A model that detects anomalies by degrading normal segmentation may not be usable.
Report threshold-free metrics and thresholded operating-point metrics.
Add component-wise analysis for small, medium, and large anomaly regions.
Convert anomaly masks into obstacle hypotheses and test track persistence before using the score in planning.
For airside transfer, create separate classes for "known airside object," "unknown but obstacle," "unknown non-obstacle," and "void/ambiguous."
Include negative controls such as unusual textures, shadows, markings, reflections, and wet pavement so the anomaly detector does not flag every domain shift as an obstacle.
Do not use public benchmark results as final safety evidence unless the same thresholding, post-processing, and runtime monitor path is used in the product stack.

SLAM Methods

Methods

Open-World OOD And Anomaly Segmentation Benchmarks ​

Scope ​

What They Measure ​

Sensors And Labels ​

Metrics And Tasks ​

Strengths ​

Gaps And Risks ​

AV And Airside Fit ​

Implementation And Evaluation Notes ​

Sources ​