STU 3D LiDAR Anomaly Segmentation Dataset

Last updated: 2026-05-09

Spotting the Unexpected (STU) is a CVPR 2025 dataset for 3D road-anomaly segmentation. It is important because most open-world driving benchmarks are image-first, while operational autonomous vehicles need a 3D hazard signal that can be projected into freespace, tracking, and planning.

Scope

Item	Description
Core task	Point-level and object-level anomaly segmentation in 3D LiDAR point clouds.
Domain	Public-road driving with staged and naturalistic road anomalies.
Modalities	128-beam LiDAR plus eight synchronized RGB cameras in a surround-view setup.
Label style	Dense point labels for inlier, anomaly, and unlabeled/questionable points; anomaly instance labels for object-level evaluation.
Temporal support	Sequential data, although the paper's reported baseline experiments focus on single scans.
Closest training context	SemanticKITTI and Panoptic-CUDAL-style road LiDAR semantics.

STU is not a generic open-set semantic segmentation dataset. It is built around anomalies that should not overlap with the in-distribution training taxonomy, so it is closer to a safety monitor benchmark than to a "discover more classes" benchmark.

Dataset And Task Definition

STU asks whether a model can separate unexpected hazardous objects from normal road-scene points when the objects are not part of the training classes. The paper defines anomalies as objects that can endanger the vehicle and passengers, especially objects on the driving surface that are absent from the training data.

The benchmark includes both out-of-distribution anomaly sequences and additional inlier-only sequences. This matters because an anomaly method must preserve normal inlier segmentation while detecting rare objects; otherwise, a model can inflate anomaly recall by flagging too much of the scene.

Key task variants:

Task	Output	Use
Point-level anomaly scoring	An anomaly score or binary anomaly label per LiDAR point	Threshold-free screening, calibration, and small-object analysis.
Object-level anomaly segmentation	Coherent anomaly instances from point predictions	Planning and operator review, where scattered point alarms are weak evidence.
Inlier panoptic segmentation	Semantic/panoptic labels for normal road classes	Verifies that anomaly detection does not destroy base perception.
Temporal/multimodal follow-on	Use sequential LiDAR and camera views	Future work for long-range confirmation and multimodal anomaly reasoning.

Sensors And Labels

Field	Details
Vehicle rig	Rigid vehicle-mounted frame.
Cameras	Eight hardware-triggered synchronized cameras.
LiDAR	High-resolution 128-beam LiDAR.
Calibration	Camera-LiDAR calibration repeated per camera; setup follows the Panoptic-CUDAL configuration.
Collection modes	Naturalistic road collection plus controlled low-speed staged object placement.
Post-processing	LiDAR odometry with KISS-ICP, KITTI-format pose export, image anonymization for faces and license plates.
Annotation tool	SemanticKITTI labeler workflow with pseudo-label initialization.
Labels	Inlier, anomaly/outlier, and unlabeled/questionable points, plus anomaly instance masks.

The CVPR paper reports 51 test sequences across six streets, eight RGB cameras, 128-beam LiDAR, temporal support, and 8,022 test / 1,960 validation anomaly-label samples in its comparison table. The dataset is designed to avoid overlap between anomaly objects and common in-distribution classes.

Metrics

Metric	Level	Interpretation
AP	Point	Threshold-free anomaly ranking; sensitive to rare anomaly points.
AUROC	Point	Ranking separation between anomaly and normal points; can look high under class imbalance.
FPR95	Point	False-positive rate at 95 percent anomaly true-positive rate; useful for operating-threshold stress.
PQ	Object/panoptic	Penalizes false positives and poor mask quality; closer to usable object hypotheses.
UQ	Object/open-set	Emphasizes recall of unknown objects without penalizing false positives as strongly as PQ.
Inlier PQ	Panoptic	Checks whether normal semantic/panoptic perception remains intact.

The benchmark evaluates points within 50 m and objects with at least five LiDAR points. That is a practical choice for road anomaly work, but it also means sub-five-point objects at long range remain an explicit residual risk for AV and airside transfer.

Failure Modes Exposed

3D anomaly points are extremely sparse relative to a full scan, so point-level AP can be low even when the scene looks easy to a human.
Small or distant anomaly objects may contain only a few points and can be swallowed by ground, curb, or vehicle classes.
Large unusual objects in familiar contexts can be confidently predicted as known classes such as pedestrian or other-vehicle.
Ground-plane removal and clustering heuristics can miss low-profile or sparse hazards before the anomaly model sees them.
A model can have good SemanticKITTI inlier performance but still be overconfident on unknown objects.
Single-scan baselines underuse the temporal and camera data that STU makes available.
Point labels do not by themselves prove that the resulting obstacle hypothesis is stable enough for tracking or planning.

AV, Indoor, Outdoor, And Airside Relevance

Environment	Fit	Notes
Public-road AV	Strong	Directly targets road debris and unexpected road objects in 3D.
Outdoor campus / industrial autonomy	Moderate	Useful for dropped objects and temporary hazards, but taxonomy and surfaces differ.
Indoor robots	Weak to moderate	The anomaly task transfers conceptually, but the sensor range, clutter, and object scale are different.
Airport apron / taxiway	Moderate proxy	The task resembles FOD, chocks, loose straps, and dropped tools, but airport surfaces and object types require target data.
Runtime assurance	Strong research proxy	Anomaly scores can feed monitors when calibrated with false-positive and persistence rules.

For airside autonomy, STU is a better proxy than RGB-only anomaly benchmarks because LiDAR points can be converted into a ground-plane hazard. It is still not an airside acceptance dataset: it lacks aircraft, GSE, pavement markings, jet bridges, apron lighting, FOD taxonomies, and airport operating procedures.

Validation And Data-Engine Use

Use STU to benchmark whether the LiDAR stack has any credible unknown-object signal before collecting airport data.
Report both threshold-free and operating-point metrics; set the threshold before the target-domain test run.
Convert anomaly points into object hypotheses and score track persistence, ground contact, and planner handoff.
Slice by range, number of anomaly points, object size, surface type, and whether the object lies in the intended path.
Preserve removed/ignored points so a denoising or ground-removal stage does not silently erase the anomaly.
Mine false positives on curbs, signs, vegetation, shadows projected into LiDAR-camera fusion, and normal road hardware.
For airside transfer, recreate the protocol with chocks, cones, tie-down straps, bolts, tools, luggage fragments, plastic wrap, and low dollies.

SLAM Methods

Methods

STU 3D LiDAR Anomaly Segmentation Dataset ​

Scope ​

Dataset And Task Definition ​

Sensors And Labels ​

Metrics ​

Failure Modes Exposed ​

AV, Indoor, Outdoor, And Airside Relevance ​

Validation And Data-Engine Use ​

Sources ​