ForeSight

What It Is

ForeSight is an ICCV 2025 multi-view streaming framework for joint 3D object detection and trajectory forecasting.
It uses surround-camera streams and keeps detection and forecasting inside one query-memory system.
The method is tracking-free: it does not require explicit object association before forecasting.
It is designed for online streaming inference rather than offline full-sequence processing.
It is most relevant to stacks that want object detection and short-horizon motion priors from a shared temporal representation.
It complements sparse query detection methods such as Sparse4D and temporal perception pages such as StreamingFlow.

Detection and forecasting should exchange information instead of running as a strict detect-then-track-then-predict chain.
Use a shared bidirectional memory that stores detections, forecasts, and query states over time.
A forecast-aware detection transformer uses multiple-hypothesis forecast memory to improve current spatial reasoning.
A streaming forecast transformer uses refined detections and past forecasts to improve temporal consistency.
Avoid explicit tracking to reduce error propagation from wrong associations.
Propagate motion hypotheses forward so current perception benefits from likely future actor states.

Input: multi-view camera images over time.
Input metadata: camera intrinsics, extrinsics, ego pose, timestamps, and temporal ordering.
Optional input: previous query memory from the streaming state.
Training input: 3D boxes and trajectory annotations from nuScenes-style data.
Output: current-frame 3D object detections.
Output: multi-modal or multiple-hypothesis trajectory forecasts.
Output: streaming memory state for the next frame.

Multi-view image encoders extract camera features.
Detection queries and forecast queries interact with current image features and historical memory.
A multiple-hypothesis forecast memory queue stores future motion candidates from prior frames.
The forecast-aware detection transformer feeds forecast context back into detection.
The streaming forecast transformer refines trajectories using current detections and past forecast states.
Memory is updated frame by frame and reused without an external tracker.
The output can feed both prediction and planning, but it does not replace dense freespace or occupancy.

The main benchmark is nuScenes.
ForeSight reports an end-to-end prediction accuracy (EPA) of 54.9%.
The paper reports a +9.3 percentage point EPA gain over previous methods.
The project page reports a +2.1 percentage point mAP gain over StreamPETR while remaining efficient for streaming inference.
The paper also reports best mAP and minADE among compared multi-view detection and forecasting models.
Evaluation should include detection metrics, forecast metrics, latency, memory reset behavior, and performance through occlusions.

Tightly couples perception and prediction without a brittle tracking handoff.
Forecast memory can help detect partially occluded or temporarily weak objects.
Streaming state is closer to deployment than offline sequence aggregation.
Multiple forecast hypotheses are more useful for planning than one deterministic future.
Camera-only input can be attractive when LiDAR is unavailable or used as an independent safety layer.
Query memory is lighter than dense BEV history for many edge deployments.

Camera-only depth and motion remain fragile under glare, darkness, rain, spray, and heavy occlusion.
Tracking-free does not mean identity-free risk disappears; planners still need stable actor IDs or state continuity.
Forecasts can reinforce false detections if the memory is not reset after scene cuts, localization jumps, or sensor faults.
Road-driving motion priors may not match apron choreography, pushback operations, baggage trains, or personnel near aircraft.
It does not output dense freespace, aircraft clearance envelopes, or FOD occupancy.
EPA and minADE can hide rare but safety-critical false negatives near the ego path.

Useful for joint detection and short-horizon motion forecasting of tugs, buses, baggage tractors, carts, service trucks, and personnel.
The memory architecture is relevant to objects that vanish briefly behind aircraft, GSE, jet bridges, or baggage trains.
Airside adaptation needs trajectory classes for coupled motion, such as tug-aircraft pushback, baggage train articulation, and belt-loader positioning.
Forecast outputs should be treated as planning priors, not as the only collision-avoidance layer.
Pair with LiDAR/radar occupancy for near-field clearance around wings, engines, cones, chocks, and FOD.
Validate by turnaround phase, occlusion duration, gate geometry, and nighttime/floodlight conditions.

Define explicit memory reset conditions for dropped frames, localization jumps, route changes, and camera faults.
Measure current-frame latency and forecast horizon accuracy together; delayed forecasts can be worse than simpler low-latency models.
Add class-specific forecast modes for slow GSE, pedestrian, aircraft, and articulated baggage carts.
Expose forecast uncertainty and multi-hypothesis weights to planning.
Keep a separate multi-object tracker if downstream modules require persistent IDs.
Validate against constant-velocity, static-object, and track-then-predict baselines before adopting the full model.