Skip to content

ForeSight

What It Is

  • ForeSight is an ICCV 2025 multi-view streaming framework for joint 3D object detection and trajectory forecasting.
  • It uses surround-camera streams and keeps detection and forecasting inside one query-memory system.
  • The method is tracking-free: it does not require explicit object association before forecasting.
  • It is designed for online streaming inference rather than offline full-sequence processing.
  • It is most relevant to stacks that want object detection and short-horizon motion priors from a shared temporal representation.
  • It complements sparse query detection methods such as Sparse4D and temporal perception pages such as StreamingFlow.

Core Technical Idea

  • Detection and forecasting should exchange information instead of running as a strict detect-then-track-then-predict chain.
  • Use a shared bidirectional memory that stores detections, forecasts, and query states over time.
  • A forecast-aware detection transformer uses multiple-hypothesis forecast memory to improve current spatial reasoning.
  • A streaming forecast transformer uses refined detections and past forecasts to improve temporal consistency.
  • Avoid explicit tracking to reduce error propagation from wrong associations.
  • Propagate motion hypotheses forward so current perception benefits from likely future actor states.

Inputs and Outputs

  • Input: multi-view camera images over time.
  • Input metadata: camera intrinsics, extrinsics, ego pose, timestamps, and temporal ordering.
  • Optional input: previous query memory from the streaming state.
  • Training input: 3D boxes and trajectory annotations from nuScenes-style data.
  • Output: current-frame 3D object detections.
  • Output: multi-modal or multiple-hypothesis trajectory forecasts.
  • Output: streaming memory state for the next frame.

Architecture or Pipeline

  • Multi-view image encoders extract camera features.
  • Detection queries and forecast queries interact with current image features and historical memory.
  • A multiple-hypothesis forecast memory queue stores future motion candidates from prior frames.
  • The forecast-aware detection transformer feeds forecast context back into detection.
  • The streaming forecast transformer refines trajectories using current detections and past forecast states.
  • Memory is updated frame by frame and reused without an external tracker.
  • The output can feed both prediction and planning, but it does not replace dense freespace or occupancy.

Training and Evaluation

  • The main benchmark is nuScenes.
  • ForeSight reports an end-to-end prediction accuracy (EPA) of 54.9%.
  • The paper reports a +9.3 percentage point EPA gain over previous methods.
  • The project page reports a +2.1 percentage point mAP gain over StreamPETR while remaining efficient for streaming inference.
  • The paper also reports best mAP and minADE among compared multi-view detection and forecasting models.
  • Evaluation should include detection metrics, forecast metrics, latency, memory reset behavior, and performance through occlusions.

Strengths

  • Tightly couples perception and prediction without a brittle tracking handoff.
  • Forecast memory can help detect partially occluded or temporarily weak objects.
  • Streaming state is closer to deployment than offline sequence aggregation.
  • Multiple forecast hypotheses are more useful for planning than one deterministic future.
  • Camera-only input can be attractive when LiDAR is unavailable or used as an independent safety layer.
  • Query memory is lighter than dense BEV history for many edge deployments.

Failure Modes

  • Camera-only depth and motion remain fragile under glare, darkness, rain, spray, and heavy occlusion.
  • Tracking-free does not mean identity-free risk disappears; planners still need stable actor IDs or state continuity.
  • Forecasts can reinforce false detections if the memory is not reset after scene cuts, localization jumps, or sensor faults.
  • Road-driving motion priors may not match apron choreography, pushback operations, baggage trains, or personnel near aircraft.
  • It does not output dense freespace, aircraft clearance envelopes, or FOD occupancy.
  • EPA and minADE can hide rare but safety-critical false negatives near the ego path.

Airside AV Fit

  • Useful for joint detection and short-horizon motion forecasting of tugs, buses, baggage tractors, carts, service trucks, and personnel.
  • The memory architecture is relevant to objects that vanish briefly behind aircraft, GSE, jet bridges, or baggage trains.
  • Airside adaptation needs trajectory classes for coupled motion, such as tug-aircraft pushback, baggage train articulation, and belt-loader positioning.
  • Forecast outputs should be treated as planning priors, not as the only collision-avoidance layer.
  • Pair with LiDAR/radar occupancy for near-field clearance around wings, engines, cones, chocks, and FOD.
  • Validate by turnaround phase, occlusion duration, gate geometry, and nighttime/floodlight conditions.

Implementation Notes

  • Define explicit memory reset conditions for dropped frames, localization jumps, route changes, and camera faults.
  • Measure current-frame latency and forecast horizon accuracy together; delayed forecasts can be worse than simpler low-latency models.
  • Add class-specific forecast modes for slow GSE, pedestrian, aircraft, and articulated baggage carts.
  • Expose forecast uncertainty and multi-hypothesis weights to planning.
  • Keep a separate multi-object tracker if downstream modules require persistent IDs.
  • Validate against constant-velocity, static-object, and track-then-predict baselines before adopting the full model.

Sources

Public research notes collected from public sources.