Sensor Corruption Robustness Benchmarks

Last updated: 2026-05-09

Sensor corruption benchmarks evaluate how perception models behave when inputs are degraded by sensor faults, synchronization errors, calibration drift, point loss, motion blur, cross-talk, incomplete echoes, and adverse-condition corruptions. This page focuses on robustness benchmark design rather than weather dataset coverage, which is handled separately in the weather robustness dataset index.

Scope

Benchmark family	Core question	Best use
Robo3D	How robust are LiDAR semantic segmentation, panoptic segmentation, and 3D detection models to common point-cloud corruptions?	LiDAR-only corruption screening and cross-dataset stress tests.
MultiCorrupt	How robust are LiDAR-camera 3D object detectors to camera, LiDAR, weather, and alignment corruptions on nuScenes-style data?	Fusion robustness, missing-modality tests, and calibration/synchronization sensitivity.
MSC-Bench	How robust are multimodal 3D detection and HD map construction models under multi-sensor corruptions?	End-to-end BEV/fusion perception stress tests, including map-construction degradation.

These benchmarks are useful because clean validation accuracy hides brittleness. A model with high clean NDS, mAP, or mIoU may fail sharply when a camera frame is missing, a LiDAR beam group drops out, or camera-LiDAR calibration shifts.

What They Measure

Failure mode	Example corruptions	Validation signal
Environmental corruption	Fog, snow, wet ground	Robustness to visible and point-cloud degradation.
Camera degradation	Missing camera, brightness, darkness, motion blur, frame loss	Sensitivity to camera availability and image quality.
LiDAR degradation	Beam reducing, point reducing, beam missing, cross-talk, incomplete echo	Robustness to sparse or physically corrupted point clouds.
Fusion misalignment	Spatial misalignment, temporal misalignment, cross-sensor perturbation	Dependence on calibration and synchronization assumptions.
Multi-corruption interaction	Sensor crash with echo/talk/cross-sensor combinations	Whether models fail under combined faults, not only isolated corruptions.

The benchmark target is not "weather robustness" alone. The main value is controlled perturbation across sensor and fusion failure modes.

Sensors And Labels

Benchmark	Modalities	Label/task basis
Robo3D	LiDAR point clouds from established datasets such as SemanticKITTI, nuScenes, KITTI, and Waymo Open Dataset variants	Semantic segmentation mIoU, panoptic segmentation, and 3D detection metrics depending on task.
MultiCorrupt	LiDAR-camera fusion data derived from nuScenes-style 3D detection workflows	nuScenes detection score and mAP for multimodal 3D object detection.
MSC-Bench	Camera and LiDAR inputs for 3D object detection and HD map construction	NDS for 3D detection and mAP/AP-style map construction metrics.

Labels generally come from the underlying clean datasets. The corruption benchmark changes the input stream while preserving the evaluation target, which allows clean-to-corrupt degradation to be measured directly.

Metrics And Tasks

Metric	Meaning
Clean task metric	mIoU, PQ, NDS, mAP, or AP on the uncorrupted validation set.
Corrupt task metric	Same task metric under corruption type and severity.
mCE	Mean corruption error relative to a baseline; lower is better.
mRR	Mean resilience rate relative to clean performance; higher is better.
RA / mRA	Resistance ability across corruption types and severities.
RRA / mRRA	Relative resistance ability compared with a baseline model.
RS / RRS	Resilience score and relative resilience score in MSC-Bench-style reporting.

For safety validation, always retain the raw per-corruption table. Aggregate robustness metrics can hide a single unacceptable failure mode such as missing pedestrians under temporal misalignment or losing map dividers under camera frame loss.

Benchmark Strengths

Provides repeatable perturbations with severity levels, enabling regression tests.
Separates clean accuracy from robustness under foreseeable sensor degradation.
Covers calibration and synchronization errors that many weather datasets do not test.
MultiCorrupt and MSC-Bench expose multimodal fusion assumptions rather than LiDAR-only behavior.
Robo3D supports LiDAR semantic segmentation and detection-style screening across multiple base datasets.
Metrics such as mCE, mRR, RA, and RRA make robustness comparable across models.

Gaps And Risks

Corruptions are approximations; they are not a substitute for target-domain sensor fault injection and recorded failures.
Weather corruptions do not capture all airport phenomena such as de-icing mist, jet blast dust, heated exhaust shimmer, glass reflections, and apron lighting glare.
Most benchmarks preserve labels from clean datasets, so they may not expose label uncertainty introduced by actual degradation.
Public-road categories do not cover aircraft, GSE, jet bridges, stand equipment, cones, chocks, or FOD.
A model can improve aggregate robustness by being conservative while still failing safety-critical rare classes.

AV And Airside Fit

Airside validation need	Fit	How to use
Missing camera or sensor outage	Strong	Use missing-camera, camera crash, frame loss, and cross-sensor slices before airport-specific fault injection.
LiDAR point loss and beam faults	Strong	Use beam/point reducing, beam missing, cross-talk, and incomplete echo as pre-airside stress tests.
Calibration drift	Strong	Use spatial and temporal misalignment slices to set monitor thresholds.
Apron semantic taxonomy	Weak	Requires target-domain labels.
FOD perception	Weak proxy	Small debris needs dedicated FOD datasets and physical validation.
HD map construction degradation	Moderate to strong	MSC-Bench is useful where BEV map construction is part of the perception stack.

For airport autonomy, treat these as bench tests that must precede apron scenario tests. They are good at exposing model brittleness early, but they cannot close the safety claim without airside recordings and sensor-specific fault-injection evidence.

Implementation And Evaluation Notes

Evaluate clean, single-corruption, and selected multi-corruption cases. Do not skip clean results because resilience metrics depend on clean performance.
Pin corruption severity definitions and benchmark commit IDs in each experiment record.
Report class-level metrics for safety-critical classes rather than only aggregate NDS, mAP, or mIoU.
Keep sensor health monitor outputs in the log. A robust perception score is not enough if the system fails to recognize it is degraded.
Use calibration and timestamp perturbations to set operational thresholds for camera-LiDAR alignment monitors.
Add an "airside transfer" row to every benchmark report: which corruption maps to which airport hazard, and which hazards remain untested.
Run the same model checkpoint across all corruption suites where possible; otherwise separate architecture gains from robustness claims.

SLAM Methods

Methods

Sensor Corruption Robustness Benchmarks ​

Scope ​

What They Measure ​

Sensors And Labels ​

Metrics And Tasks ​

Benchmark Strengths ​

Gaps And Risks ​

AV And Airside Fit ​

Implementation And Evaluation Notes ​

Sources ​