S2R-Bench

What It Is

S2R-Bench is a sim-to-real evaluation benchmark for autonomous driving perception.
The full title is "S2R-Bench: A Sim-to-Real Evaluation Benchmark for Autonomous Driving."
It focuses on real-world sensor anomaly data and compares it with simulated anomaly data.
The benchmark is published through Scientific Data and has an official GitHub repository.
It is framed as a corruption robustness benchmark grounded in real scenarios rather than simulation only.
The dataset targets camera, LiDAR, and 4D radar perception robustness.

Existing robustness benchmarks often rely on simulated corruptions.
S2R-Bench collects real sensor anomaly data under varied roads, weather, lighting, and time periods.
It then compares real anomalies with simulated data to judge how well simulation transfers.
The benchmark makes robustness evaluation more realistic by anchoring corruptions to captured sensor behavior.
It supports both real anomaly testing and generated anomaly testing.
It emphasizes the gap between clean benchmarks and safety-critical adverse conditions.
The "sim-to-real" framing asks whether simulated corruption is a trustworthy proxy for real sensor failure.

Inputs include high-resolution camera images, 80-line LiDAR, and two 4D radar sources on the collection vehicle.
The repository documents image data in PNG format and point cloud data in BIN format.
Calibration files include camera intrinsics and sensor-to-camera transformations.
Label files use a KITTI-like structure with class, truncation, occlusion, 2D box, 3D dimensions, location, rotation, score, and track ID.
Outputs are benchmark datasets, model checkpoints, and evaluation results for object detection.
The dataset includes car, pedestrian, cyclist, and other traffic classes.
It is not a fusion architecture; it is an evaluation dataset and toolkit.

Data collection was conducted in Beijing in December 2023.
The README reports roughly 700 km of covered roads.
Scenarios include city roads, suburban roads, motorways, tunnels, towns, villages, communities, campuses, parking, and roundabouts.
The repository describes real and simulated anomaly data for light snow, moderate snow, fog, and brightness.
Simulation tools can add adverse weather or noise to LiDAR, 4D radar, and image data.
The benchmark includes separate evaluation tracks or train types referenced in the model zoo.
Visualization and benchmarking scripts are available in the official repository.

The repository reports 9981 labeled frames in its data statistics.
Car, pedestrian, and cyclist dominate the labels, with approximate shares of 53%, 20%, and 19% in the README.
Most annotated objects are within the near-to-mid range of the ego vehicle.
The model zoo includes PointPillars, SMOKE, and Focals-Conv examples.
Evaluations compare performance on real anomaly data and simulated anomaly data.
The Scientific Data article states that data are publicly released through Figshare repositories.
The GitHub repository provides visualization, setup, and benchmark code.

Real anomaly data makes it more informative than synthetic-only corruption benchmarks.
4D radar inclusion is valuable for adverse weather perception research.
Multi-road and day-night collection increases diversity.
KITTI-like labels and calibration files make it familiar to many perception teams.
The sim-to-real comparison is directly useful for deciding whether synthetic weather tests are enough.
It can expose methods that look robust in simulation but fail on real sensor artifacts.

The collection geography is Beijing roads, not airports or industrial aprons.
The dataset is still finite and may not include rare airport-specific anomalies.
Public access and repository instructions should be checked because data links and agreements can change.
Class distribution is road-centric and omits aircraft, ground support equipment, cones, chocks, and tow bars.
Radar hardware and mounting may differ from an airside vehicle.
Evaluation results may depend on the exact split between real, clean, and simulated data.
Real anomalies are harder to parameterize into clean severity levels than synthetic corruptions.

S2R-Bench is valuable as evidence that real sensor anomalies can diverge from simulated corruptions.
Its multi-modal camera, LiDAR, and 4D radar setup resembles robust airside sensor-suite goals.
The benchmark logic should be copied for airport data collection: real fog, rain, spray, darkness, glare, and snow.
Airside transfer requires new labels for aircraft, GSE, workers, safety cones, dollies, tow bars, and stand markings.
It is especially relevant for validating whether synthetic de-icing spray or fog actually predicts field performance.
Use it as a benchmark-design reference rather than a complete airside validation dataset.

Verify dataset access through the Nature article and official repository before planning experiments.
Keep real and simulated anomaly splits separate in reports.
Do not tune exclusively on simulated anomalies if the goal is field robustness.
For airside work, collect paired real and simulated anomaly sets so the sim-to-real gap can be measured.
Preserve calibration files and sensor timestamps because failure analysis often depends on alignment.
Report per-condition metrics, not only aggregate AP, so ODD policies can use the results.