Skip to content

Event-Camera VIO and SLAM

Related docs: OpenVINS, VINS-Mono / VINS-Fusion, SVO, LSD-SLAM / DSO, factor graphs and iSAM2, event and thermal camera models, visible cameras, and robust multi-sensor localization.

Executive Summary

Event-camera VIO/SLAM uses asynchronous brightness-change events rather than ordinary frame images as the main visual signal. The attraction is clear: event cameras have high temporal resolution, high dynamic range, low latency, and resistance to motion blur. Those properties directly target failure cases of frame-based VIO: aggressive motion, flicker, glare, night lighting transitions, and high-speed robotics.

Representative systems include early EVIO, ESVIO, PL-EVIO, EVI-SAM, and ESVO2. The field has two broad families:

  • Feature/factor graph systems: track event corners, line features, and optional frame-camera features, then optimize IMU preintegration plus visual residuals.
  • Direct event systems: align event representations such as time surfaces, adaptive accumulations, and event images, often with stereo depth and IMU priors.

For AVs, event-camera SLAM is promising as a high-dynamic-range, low-latency supplement. It is not yet a drop-in replacement for camera/LiDAR/radar localization because event data is motion-dependent, sparse when the scene is static, harder to calibrate, and less supported by production perception tooling.

Problem Fit

Event-camera VIO/SLAM fits:

  • High-speed motion where frame cameras blur.
  • High dynamic range scenes with headlights, sun glare, shadows, tunnel exits, or floodlights.
  • Low-latency stabilization for drones, small robots, and aggressive maneuvers.
  • Scenarios where edge structure is strong and static.
  • Stereo event rigs where depth can be recovered without ordinary frames.

It is weaker for:

  • Slow or stopped vehicles, because events vanish without brightness change.
  • Textureless scenes with weak edges.
  • Scenes dominated by independently moving objects.
  • Standard automotive localization stacks that require mature tooling, maps, and certification evidence.

Sensor Model

An event camera emits events:

text
e_i = (u_i, v_i, t_i, p_i)

where (u, v) is pixel location, t is timestamp, and p is polarity. An event is triggered when log intensity changes by a contrast threshold:

text
L(u, v, t_i) - L(u, v, t_i - dt) = p_i C

VIO systems may use:

  • Monocular event camera + IMU.
  • Stereo event cameras + IMU.
  • Event camera + standard frame camera + IMU.
  • Optional depth maps, TSDF fusion, or dense event mapping.

The estimator state usually resembles visual-inertial SLAM:

text
x = [R, p, v, b_g, b_a, T_event_imu, T_frame_event, camera_intrinsics]

with landmarks, line features, depth maps, or event-map states depending on the method.

Pipeline

  1. Event preprocessing

    • Denoise hot pixels and refractory noise.
    • Build time surfaces, event frames, adaptive accumulation maps, or feature tracks.
    • Optionally synchronize event packets with frame images.
  2. IMU propagation

    • Preintegrate IMU between keyframes or event windows.
    • Provide motion priors for fast tracking and weakly observable axes.
  3. Front-end tracking

    • Feature methods detect event corners/lines and track point/line measurements.
    • Direct methods align event representations or time surfaces.
    • Stereo methods estimate depth through temporal/static stereo matching.
  4. Back-end optimization/filtering

    • Fuse IMU residuals with event reprojection, line, direct alignment, or photometric-like residuals.
    • Maintain a sliding window, keyframes, local map, or dense TSDF.
  5. Mapping

    • Sparse systems keep landmarks or line features.
    • ESVO-style systems build semi-dense depth maps.
    • EVI-SAM reconstructs dense colored maps using image-guided event mapping and TSDF fusion.
  6. Diagnostics

    • Track event rate, feature distribution, line support, IMU residuals, bias stability, and event/frame timing.

Mathematical Mechanics

Feature-based event VIO uses point and line residuals:

text
X* = arg min_X
    sum_imu    || r_imu(x_i, x_j) ||^2
  + sum_point  rho(|| z_uv - pi(T_CW p_W) ||^2)
  + sum_line   rho(|| r_line(T_CW, L_W, z_line) ||^2)
  + sum_prior  || r_prior ||^2

PL-EVIO is representative: event point and line residuals, frame-image point residuals, and IMU preintegration are tightly coupled in a keyframe graph.

Direct event methods avoid explicit matching under large baseline changes. They align event representations:

text
r_direct = E_ref(u) - E_cur( warp(u, T, depth) )

where E may be a time surface or adaptive accumulation image. ESVO2 improves earlier ESVO-style direct stereo event odometry by using adaptive accumulation, contour-point sampling, static/temporal stereo fusion, IMU preintegration, and a tightly coupled back end for velocity and IMU bias.

EVI-SAM combines feature matching and direct alignment:

text
X* = arg min_X
    sum_imu       || r_imu ||^2
  + sum_reproj    rho(|| r_event_reproj ||^2)
  + sum_align     rho(|| r_event_2d2d_align ||^2)
  + sum_mapping   rho(|| r_depth_tsdf ||^2)

The observability challenge is distinct: event data depends on motion and scene contrast. IMU priors are not just helpful for scale; they stabilize rotation and velocity when event alignment is underconstrained.

Assumptions

  • Event camera timestamps are precise and synchronized with IMU.
  • Contrast thresholds and biases are stable or calibrated.
  • There is sufficient relative motion to generate useful events.
  • Static edges dominate the event stream.
  • IMU noise and bias models are realistic.
  • Stereo event rigs have stable calibration and enough event correspondence.
  • Frame/event fusion methods correctly model the time and modality difference.

Strengths

  • High dynamic range for glare, headlights, tunnel exits, and night floodlights.
  • Low latency and high temporal resolution.
  • Strong resistance to motion blur.
  • Efficient sparse data under motion.
  • Line features are useful in human-made scenes.
  • Stereo event systems can recover metric depth without ordinary frames.
  • Good complement to conventional cameras and IMUs for aggressive motion.

Limitations

  • Static scenes produce little data.
  • Motion-dependent measurements complicate data association and benchmarking.
  • Event streams include sensor noise, hot pixels, and flicker artifacts.
  • Textureless but thermally/visually uniform scenes remain hard.
  • Calibration is more complex than ordinary frame-camera VIO.
  • Learned or hand-tuned event representations may not transfer across sensors.
  • Automotive-grade event SLAM tooling and datasets are less mature than LiDAR/VIO.
  • Rain, snow, rotating beacons, LED flicker, and dynamic objects can create high event clutter.

Datasets and Benchmarks

Common datasets:

  • MVSEC: stereo event, frame, LiDAR, IMU, and ground truth in driving and indoor scenes.
  • DSEC: stereo event driving dataset with high-resolution event cameras.
  • TUM-VIE: visual-inertial event dataset with motion-capture/ground-truth segments.
  • EVIMO2: event-camera dataset with object motion and ground truth.
  • RPG event datasets: classic event-camera VO/VIO sequences.
  • ESVIO / PL-EVIO / EVI-SAM author data: useful for reproducing method claims.

Metrics:

  • ATE/RPE with consistent alignment.
  • Tracking success rate under fast motion and HDR.
  • Event-rate sensitivity and latency.
  • IMU bias drift.
  • Depth-map completeness for stereo/direct systems.
  • Runtime on CPU and embedded targets.
  • Failure rate during low-motion intervals.

AV Relevance

Event cameras can help AV localization where conventional cameras struggle:

  • Headlight glare and night ramp lighting.
  • Rapid illumination transitions under terminals or tunnels.
  • Motion blur during vibration or fast maneuvers.
  • Low-latency wheel-slip or yaw-rate correction in close-quarters movement.

They are not a primary AV localization sensor today. A practical architecture would use event VIO as an aiding factor:

  • Frame camera/LiDAR/radar map localization remains the main pose source.
  • Event-camera residuals provide high-rate relative motion and HDR edge constraints.
  • A supervisor gates event input using event rate, spatial distribution, and dynamic-object indicators.

Indoor/Outdoor Relevance

Indoor: Strong in high-contrast corridors, warehouses, labs, and drone spaces. Weak when lighting flicker or low motion dominates.

Outdoor: Promising for HDR and fast motion. Needs robust dynamic-object rejection and weather validation.

Mixed indoor/outdoor: Strong use case because dynamic range changes are exactly where frame cameras can degrade.

Airside Deployment Notes

Airside has several event-camera opportunities:

  • Floodlight glare at night.
  • Sudden shadow/sun transitions near terminal structures.
  • High-contrast aircraft, poles, markings, and service-road edges.
  • Low-latency motion feedback for small autonomous tow tractors or inspection robots.

But risks are significant:

  • Flashing beacons and LED signage can create event clutter.
  • Rain, spray, and rotating lights can dominate events.
  • Slow pushback and creeping maneuvers may not generate enough events.
  • Dynamic GSE and crew can violate static-scene assumptions.

Use event VIO as a supplemental relative-motion channel, not a standalone global localization method.

Validation Checklist

  • Calibrate event-camera intrinsics, event-IMU extrinsics, and time offset.
  • Measure event rate and feature coverage by scenario.
  • Test stopped/slow motion, fast vibration, and aggressive turns.
  • Include HDR, night, glare, LED flicker, rain, and spray.
  • Compare against frame VIO under matched motion.
  • Verify estimator covariance during low-event intervals.
  • Confirm dynamic-object gating under moving vehicles and people.
  • Test runtime on target compute with realistic event rates.
  • Validate recovery after event burst saturation.

Sources

Public research notes collected from public sources.