Streaming & Temporal Perception for Real-Time Driving
From Single-Frame Detection to Temporally Consistent Scene Understanding
Last updated: 2026-04-11
Summary: Single-frame 3D detection ignores temporal information that is freely available from sequential sensor data. Streaming perception methods propagate object queries, BEV features, or latent states across frames to improve detection accuracy (+3-8% NDS), enable implicit tracking (AMOTA 65%+ without explicit association), and compensate for inference latency — all critical for airside operations where slow-moving objects (1-5 km/h carts) are easily missed in single frames. This document covers every major streaming perception architecture from dense BEV temporal fusion (BEVFormer, SOLOFusion) through sparse query propagation (StreamPETR, Sparse4D v3) to LiDAR temporal accumulation (CenterPoint-Temporal, TransFusion-L), with latency-aware methods (LASP, ASAP) that predict object positions during inference delay. The key finding: StreamPETR's object-centric temporal mechanism adds only 2-3ms overhead while boosting NDS by 6-8% and providing free multi-object tracking — the highest value-per-FLOP temporal method for Orin deployment. For the reference airside AV stack's LiDAR-primary stack, simple multi-scan accumulation (3-5 frames, 0.5s window) combined with CenterPoint-Temporal provides most of the benefit at minimal engineering cost.
Table of Contents
- Why Temporal Perception Matters
- Temporal Information Sources
- Dense BEV Temporal Fusion
- Sparse Query Propagation
- LiDAR Temporal Accumulation
- Streaming Perception and Latency Compensation
- Temporal Occupancy Prediction
- Video Backbone Architectures
- Temporal Fusion for Multi-Modal Stacks
- Orin Deployment Strategies
- Airside-Specific Temporal Challenges
- Implementation Guide
- Key Takeaways
- References
1. Why Temporal Perception Matters
1.1 The Single-Frame Problem
Single-frame 3D detection treats each LiDAR sweep or camera capture independently, discarding:
- Velocity information: A stationary and moving object at the same position look identical
- Occluded object memory: Objects hidden behind aircraft fuselage for 1-2 seconds vanish from detections
- Confidence accumulation: Multiple observations of the same object should yield higher confidence
- Trajectory history: Critical for motion prediction and collision avoidance
- Sparse region densification: Multiple frames fill in gaps from LiDAR scan pattern
1.2 Why This Matters More on Airside
Airport ramps create uniquely challenging temporal scenarios:
| Challenge | Why Single-Frame Fails | Temporal Solution |
|---|---|---|
| Slow-moving GSE (1-5 km/h) | Below LiDAR motion threshold, looks static | Velocity from multi-frame displacement |
| Aircraft fuselage occlusion | 30-65m opaque body hides entire lanes | Memory retains objects during occlusion |
| Personnel crouching/bending | Height changes cause missed detections | Temporal smoothing maintains track |
| FOD on ground | Few LiDAR returns per object (2-5 points) | Multi-scan accumulation increases point count |
| De-icing spray clouds | Transient false positives per-frame | Temporal filtering (real obstacles persist) |
| Jet blast shimmer | LiDAR distortion varies frame-to-frame | Temporal consistency check rejects jitter |
1.3 Taxonomy of Temporal Methods
Temporal Perception Methods
├── Dense representation temporal fusion
│ ├── BEV concatenation (BEVDet4D, SOLOFusion)
│ ├── BEV warping + attention (BEVFormer v2)
│ └── BEV recurrence (BEVerse)
├── Sparse query propagation
│ ├── Object queries across frames (StreamPETR)
│ ├── Anchor-based temporal (Sparse4D v3)
│ └── Track queries (MUTR3D, PF-Track)
├── Point cloud accumulation
│ ├── Multi-sweep ego-motion compensated (CenterPoint)
│ ├── 4D radar accumulation (100+ sweeps)
│ └── Learned temporal point aggregation
├── Latency-aware streaming
│ ├── Future prediction (Streamer, LASP)
│ ├── Dual-computation (ASAP)
│ └── Async temporal fusion (StreamingFlow)
└── Temporal occupancy
├── 4D forecasting (UnO, OccSora)
└── Scene flow integration (DeFlow)2. Temporal Information Sources
2.1 Ego-Motion Compensation
The foundation of all temporal methods: transform historical data to the current ego-vehicle coordinate frame.
import numpy as np
from scipy.spatial.transform import Rotation
class EgoMotionCompensator:
"""Transform past sensor data to current vehicle frame."""
def __init__(self, max_history: int = 10):
self.pose_history = [] # List of (timestamp, T_world_ego) tuples
self.max_history = max_history
def update_pose(self, timestamp: float, T_world_ego: np.ndarray):
"""Add current pose from GTSAM localization."""
self.pose_history.append((timestamp, T_world_ego.copy()))
if len(self.pose_history) > self.max_history:
self.pose_history.pop(0)
def transform_to_current(
self, points: np.ndarray, source_timestamp: float
) -> np.ndarray:
"""Transform points from historical frame to current frame.
Args:
points: (N, 3+) point cloud in source frame
source_timestamp: timestamp of source frame
Returns:
points_current: (N, 3+) points in current ego frame
"""
T_world_source = self._get_pose(source_timestamp)
T_world_current = self._get_pose(self.pose_history[-1][0])
# T_current_source = T_current_world @ T_world_source
T_current_source = np.linalg.inv(T_world_current) @ T_world_source
# Transform xyz, keep other features (intensity, etc.)
xyz = points[:, :3]
xyz_h = np.hstack([xyz, np.ones((len(xyz), 1))])
xyz_transformed = (T_current_source @ xyz_h.T).T[:, :3]
result = points.copy()
result[:, :3] = xyz_transformed
return result
def _get_pose(self, timestamp: float) -> np.ndarray:
"""Interpolate pose at given timestamp."""
# Find bracketing poses
for i in range(len(self.pose_history) - 1):
t0, T0 = self.pose_history[i]
t1, T1 = self.pose_history[i + 1]
if t0 <= timestamp <= t1:
alpha = (timestamp - t0) / (t1 - t0 + 1e-9)
# SLERP for rotation, LERP for translation
R0 = Rotation.from_matrix(T0[:3, :3])
R1 = Rotation.from_matrix(T1[:3, :3])
R_interp = Rotation.from_rotvec(
(1 - alpha) * R0.as_rotvec() + alpha * R1.as_rotvec()
)
t_interp = (1 - alpha) * T0[:3, 3] + alpha * T1[:3, 3]
T = np.eye(4)
T[:3, :3] = R_interp.as_matrix()
T[:3, 3] = t_interp
return T
# Extrapolate from last two poses if needed
return self.pose_history[-1][1]2.2 Temporal Data Sources for reference airside AV stack Stack
| Source | Rate | History Depth | Compensation | Value |
|---|---|---|---|---|
| RoboSense LiDAR | 10 Hz | 5-10 frames (0.5-1.0s) | Ego-motion (GTSAM) | Point densification, velocity |
| IMU (SBG Ellipse) | 500 Hz | Continuous | N/A (used for compensation) | Sub-frame motion correction |
| GTSAM pose | 10 Hz | Full history | N/A (reference frame) | Ego-motion for warping |
| Wheel odometry | 100 Hz | 10s buffer | N/A (input to GTSAM) | Dead reckoning during GPS dropout |
| Thermal camera | 30 Hz | 3-5 frames | Image-level warp | Personnel trajectory |
| 4D radar | 20 Hz | 50-100 frames (2.5-5s) | Doppler-aware | Long-range velocity field |
2.3 Intra-Scan Motion Correction
LiDAR sweeps are not instantaneous — a 10 Hz RoboSense RSHELIOS has 100ms rotation period. At vehicle speed of 25 km/h, points at start vs. end of sweep are separated by ~0.7m.
def motion_correct_lidar_sweep(
points: np.ndarray,
point_timestamps: np.ndarray,
imu_data: list,
scan_start_time: float,
scan_end_time: float
) -> np.ndarray:
"""Correct intra-scan motion distortion using IMU data.
Each point is transformed from its capture time to scan_end_time.
This is critical for multi-LiDAR fusion where sweeps
are not perfectly synchronized.
"""
corrected = points.copy()
# For each point, compute relative motion since scan start
for i, (pt, t) in enumerate(zip(points, point_timestamps)):
# Interpolate IMU pose at point capture time
dt = scan_end_time - t
# Angular velocity integration (simplified)
omega = interpolate_imu_angular(imu_data, t)
velocity = interpolate_imu_linear(imu_data, t)
# Small-angle rotation correction
dR = np.eye(3) + dt * skew_symmetric(omega)
dt_vec = velocity * dt
corrected[i, :3] = dR @ pt[:3] + dt_vec
return corrected3. Dense BEV Temporal Fusion
3.1 BEVFormer Temporal Self-Attention
BEVFormer (ECCV 2022) introduced temporal BEV fusion via deformable attention between current and historical BEV features.
Architecture:
Frame t-1 BEV features ─→ Ego-motion warp ─→ Deformable cross-attention
↓
Frame t Image features ─→ Spatial cross-attention ─→ BEV features tKey mechanism: Temporal self-attention (TSA) samples reference points from the warped historical BEV grid, allowing the model to attend to where objects were in the previous frame.
class TemporalSelfAttention:
"""BEVFormer-style temporal self-attention layer."""
def __init__(self, embed_dim=256, num_heads=8, num_levels=1, num_points=4):
self.deformable_attn = DeformableAttention(
embed_dim, num_heads, num_levels, num_points
)
def forward(
self,
query: torch.Tensor, # (B, H*W, C) current BEV queries
prev_bev: torch.Tensor, # (B, H*W, C) previous BEV features
ego_motion: torch.Tensor, # (B, 4, 4) T_current_prev
reference_points: torch.Tensor # (B, H*W, 2) BEV grid coordinates
) -> torch.Tensor:
"""Attend from current BEV queries to warped previous BEV."""
# Warp previous BEV reference points using ego-motion
# (transform 2D BEV coordinates by rigid motion)
warped_ref = self.warp_reference_points(reference_points, ego_motion)
# Stack current (self) + previous (cross-temporal) as key/value
key_value = torch.stack([query, prev_bev], dim=1) # (B, 2, H*W, C)
# Deformable attention across temporal dimension
output = self.deformable_attn(query, key_value, warped_ref)
return outputPerformance impact (nuScenes):
- BEVFormer without temporal: 41.6% NDS
- BEVFormer with 1-frame temporal: 51.7% NDS (+10.1%)
- Diminishing returns beyond 2-3 frames for camera-based BEV
3.2 SOLOFusion: Long-Term Stereo for Multi-View 3D Detection
SOLOFusion (ICLR 2023) demonstrates that long temporal history dramatically improves camera 3D detection by providing implicit stereo cues.
Key insight: Multi-frame ego-motion-compensated image features create wide-baseline stereo pairs. A camera moving 7m/s over 16 frames (1.6s) creates a 11.2m baseline — far wider than any physical stereo rig.
| Temporal Frames | mAP | NDS | Notes |
|---|---|---|---|
| 1 frame | 35.4% | 44.9% | Single-frame baseline |
| 4 frames | 42.1% | 50.8% | Noticeable improvement |
| 9 frames | 48.2% | 55.3% | Significant gain |
| 16 frames | 53.4% | 58.2% | Best performance |
| 17+ frames | ~53.4% | ~58.2% | Saturates |
Limitation for airside: SOLOFusion assumes sufficient ego-motion for stereo parallax. reference airside vehicles often move at 1-5 km/h — 16 frames at 10 Hz with 2 km/h motion gives only 0.9m baseline. This limits camera-based temporal benefits for slow-speed operations.
3.3 BEV Temporal Memory Management
For Orin deployment, BEV temporal features consume significant memory:
BEV grid: 200 x 200 x 256 channels = ~20 MB per frame (FP16)
3-frame history: ~60 MB
10-frame history: ~200 MB
Orin AGX 64GB: ~60 MB is manageable
Orin NX 16GB: ~60 MB is 0.4% of RAM — negligibleKV-cache for temporal BEV:
class BEVTemporalCache:
"""FIFO cache for BEV temporal features with ego-motion warping."""
def __init__(self, max_frames: int = 3, bev_shape=(200, 200, 256)):
self.max_frames = max_frames
self.cache = [] # List of (timestamp, bev_features, T_world_ego)
def update(self, timestamp, bev_features, T_world_ego):
"""Push new frame, pop oldest if full."""
self.cache.append((timestamp, bev_features.clone(), T_world_ego.copy()))
if len(self.cache) > self.max_frames:
self.cache.pop(0)
def get_warped_history(self, T_world_current):
"""Get all historical BEV features warped to current frame."""
warped = []
for ts, bev, T_world_past in self.cache[:-1]: # Exclude current
T_current_past = np.linalg.inv(T_world_current) @ T_world_past
warped_bev = self.warp_bev(bev, T_current_past)
warped.append(warped_bev)
return warped
def warp_bev(self, bev, T_relative):
"""Warp BEV features using 2D affine transform from SE(3)."""
# Extract 2D rotation and translation from 3D transform
cos_theta = T_relative[0, 0]
sin_theta = T_relative[1, 0]
tx, ty = T_relative[0, 3], T_relative[1, 3]
# Convert to grid_sample affine matrix
# ... (standard 2D affine warp)
return torch.nn.functional.grid_sample(
bev.permute(2, 0, 1).unsqueeze(0), # (1, C, H, W)
grid, align_corners=True
).squeeze(0).permute(1, 2, 0)4. Sparse Query Propagation
4.1 StreamPETR: Object-Centric Temporal Modeling
StreamPETR (ICCV 2023) is the most efficient temporal method — propagating object-level queries rather than dense BEV features.
Core mechanism:
- Initialize N=900 learnable object queries
- Per frame: queries attend to image features via spatial cross-attention
- After detection: top-K=256 foreground queries become "memory" for next frame
- Memory queries carry object identity, position, velocity across frames
- Motion-aware layer normalization encodes inter-frame displacement
class StreamPETRTemporalPropagation:
"""StreamPETR's query propagation across frames."""
def __init__(self, embed_dim=256, num_queries=900, top_k=256, num_heads=8):
self.embed_dim = embed_dim
self.num_queries = num_queries
self.top_k = top_k
# Learnable object queries
self.query_embedding = nn.Embedding(num_queries, embed_dim)
self.query_pos = nn.Embedding(num_queries, embed_dim)
# Temporal propagation transformer
self.temporal_attn = nn.MultiheadAttention(embed_dim, num_heads)
self.spatial_attn = nn.MultiheadAttention(embed_dim, num_heads)
# Motion-aware layer norm
self.motion_ln = MotionAwareLayerNorm(embed_dim)
# Memory queue
self.memory_queue = [] # (queries, positions, ego_motion)
def forward(self, image_features, ego_motion):
"""Process one frame with temporal propagation.
Args:
image_features: (B, N_cam, H*W, C) multi-view features
ego_motion: (B, 4, 4) ego-motion from last frame
"""
# Current frame queries
queries = self.query_embedding.weight.unsqueeze(0) # (1, 900, C)
# Temporal attention: current queries attend to memory
if self.memory_queue:
memory = self.get_warped_memory(ego_motion)
queries = self.temporal_attn(
query=queries,
key=memory,
value=memory
)[0]
# Apply motion-aware normalization
queries = self.motion_ln(queries, ego_motion)
# Spatial attention: queries attend to image features
queries = self.spatial_attn(
query=queries,
key=image_features.flatten(1, 2),
value=image_features.flatten(1, 2)
)[0]
# Generate detections
detections = self.detection_head(queries)
# Select top-K foreground queries for memory
scores = detections['scores'] # (B, 900)
top_k_idx = torch.topk(scores, self.top_k, dim=-1).indices
memory_queries = torch.gather(
queries, 1, top_k_idx.unsqueeze(-1).expand(-1, -1, self.embed_dim)
)
# Update memory queue
self.memory_queue.append(memory_queries.detach())
if len(self.memory_queue) > 3:
self.memory_queue.pop(0)
return detectionsStreamPETR performance (nuScenes):
| Configuration | mAP | NDS | AMOTA | FPS (A100) | FPS (Orin est.) |
|---|---|---|---|---|---|
| PETR (no temporal) | 39.1% | 45.5% | — | 18.5 | ~6 |
| StreamPETR-S (R50) | 45.0% | 55.0% | 44.7% | 31.7 | ~10 |
| StreamPETR-L (V2-99) | 55.0% | 63.6% | 57.1% | 12.1 | ~4 |
| StreamPETR-L (V2-99, TTA) | 57.1% | 65.5% | 65.3% | — | — |
Why StreamPETR is ideal for Orin:
- Query propagation adds only 256 queries × 256 channels = 128 KB memory overhead per frame (vs 20 MB for BEV)
- Temporal attention over 256 memory queries takes <1ms
- Built-in tracking capability — object identity persists through queries
- No BEV grid needed — works directly in 3D query space
4.2 Sparse4D v3: Anchor-Based Temporal with Tracking
Sparse4D v3 (2024) extends sparse temporal detection with:
- Temporal instance denoising: Augments training by adding noise to ground truth in 3D+time
- Quality estimation: Predicts detection quality for NMS-free inference
- ID-free tracking: Assigns track IDs by query identity — same query across frames = same object
Performance on nuScenes test:
- Sparse4D v3 (ResNet101): 63.0% mAP, 71.9% NDS, 67.7% AMOTA
- vs StreamPETR-L: +6.0% mAP, +6.4% NDS — currently the sparse query SOTA4.3 MUTR3D and Track Queries
MUTR3D introduced the concept of persistent track queries — queries that live for the lifetime of an object, initialized at first detection and terminated when the object leaves the scene.
class TrackQueryManager:
"""Manages long-lived track queries for persistent object tracking."""
def __init__(self, max_tracks=100, max_age=10, embed_dim=256):
self.active_tracks = {} # track_id -> (query, age, state)
self.next_id = 0
self.max_age = max_age
self.embed_dim = embed_dim
def update(self, detection_queries, detection_scores, threshold=0.3):
"""Match detections to existing tracks, create/destroy tracks."""
# Hungarian matching between detection queries and track queries
if self.active_tracks:
track_queries = torch.stack([t[0] for t in self.active_tracks.values()])
cost_matrix = 1 - torch.cosine_similarity(
detection_queries.unsqueeze(1),
track_queries.unsqueeze(0),
dim=-1
)
row_ind, col_ind = linear_sum_assignment(cost_matrix.cpu().numpy())
matched_det = set()
matched_track = set()
for r, c in zip(row_ind, col_ind):
if cost_matrix[r, c] < 0.5: # Match threshold
track_id = list(self.active_tracks.keys())[c]
self.active_tracks[track_id] = (
detection_queries[r], 0, 'active'
)
matched_det.add(r)
matched_track.add(track_id)
else:
matched_det = set()
matched_track = set()
# Age unmatched tracks
for tid in list(self.active_tracks.keys()):
if tid not in matched_track:
q, age, state = self.active_tracks[tid]
if age >= self.max_age:
del self.active_tracks[tid]
else:
self.active_tracks[tid] = (q, age + 1, 'lost')
# Create new tracks for unmatched high-confidence detections
for i in range(len(detection_queries)):
if i not in matched_det and detection_scores[i] > threshold:
self.active_tracks[self.next_id] = (
detection_queries[i], 0, 'active'
)
self.next_id += 15. LiDAR Temporal Accumulation
5.1 Multi-Sweep Point Cloud Aggregation
The simplest and most effective temporal method for LiDAR: concatenate ego-motion-compensated sweeps.
class MultiSweepAccumulator:
"""Accumulate ego-motion-compensated LiDAR sweeps.
Standard approach: 10 sweeps (1s at 10Hz) for nuScenes.
For airside at 5-25 km/h, 3-5 sweeps (0.3-0.5s) is optimal.
"""
def __init__(self, num_sweeps: int = 5, max_distance: float = 100.0):
self.num_sweeps = num_sweeps
self.max_distance = max_distance
self.sweep_buffer = [] # (points, timestamp, T_world_ego)
def add_sweep(self, points, timestamp, T_world_ego):
"""Add new sweep, maintaining FIFO buffer."""
self.sweep_buffer.append((points, timestamp, T_world_ego))
if len(self.sweep_buffer) > self.num_sweeps:
self.sweep_buffer.pop(0)
def get_accumulated(self):
"""Get all sweeps transformed to latest ego frame.
Returns:
accumulated: (N_total, 5) — x, y, z, intensity, time_offset
"""
if not self.sweep_buffer:
return np.empty((0, 5))
latest_ts, _, T_world_latest = (
self.sweep_buffer[-1][1],
self.sweep_buffer[-1][0],
self.sweep_buffer[-1][2]
)
T_latest_world = np.linalg.inv(T_world_latest)
all_points = []
for points, ts, T_world_ego in self.sweep_buffer:
# Transform to latest ego frame
T_latest_past = T_latest_world @ T_world_ego
xyz = points[:, :3]
xyz_h = np.hstack([xyz, np.ones((len(xyz), 1))])
xyz_transformed = (T_latest_past @ xyz_h.T).T[:, :3]
# Add time offset as feature (seconds relative to current)
time_offset = np.full((len(points), 1), ts - latest_ts)
# Filter by distance
dist = np.linalg.norm(xyz_transformed, axis=1)
mask = dist < self.max_distance
augmented = np.hstack([
xyz_transformed[mask],
points[mask, 3:4], # intensity
time_offset[mask]
])
all_points.append(augmented)
return np.vstack(all_points) if all_points else np.empty((0, 5))5.2 CenterPoint Temporal Extension
CenterPoint's temporal extension concatenates multi-sweep pillar features before the backbone:
Frame t: Points → Pillarize → Pillar Features (C=64)
Frame t-1: Points → Ego-compensate → Pillarize → Pillar Features (C=64)
Frame t-2: Points → Ego-compensate → Pillarize → Pillar Features (C=64)
Concatenate → (C=192) → Conv reduce to C=64 → Standard backbonePerformance impact on nuScenes:
| Method | Sweeps | mAP | NDS | Latency (Orin INT8) |
|---|---|---|---|---|
| CenterPoint | 1 | 56.4% | 64.8% | 6.84ms |
| CenterPoint | 3 | 58.9% | 66.7% | 8.2ms |
| CenterPoint | 5 | 59.6% | 67.2% | 9.8ms |
| CenterPoint | 10 | 60.3% | 67.5% | 14.1ms |
Recommendation for reference airside AV stack: 3-sweep accumulation provides 80% of the temporal benefit at only 1.4ms additional cost. Beyond 5 sweeps, motion compensation errors from GTSAM pose uncertainty dominate.
5.3 Temporal Point Cloud Features
Beyond concatenation, learned temporal features capture motion patterns:
class TemporalPointFeatureEncoder(nn.Module):
"""Encode temporal information per point.
Each point gets features from its time offset and
displacement relative to previous frames.
"""
def __init__(self, time_channels=16, motion_channels=32):
super().__init__()
# Sinusoidal time encoding (like positional encoding)
self.time_encoder = SinusoidalEncoding(time_channels)
# Motion feature from displacement
self.motion_mlp = nn.Sequential(
nn.Linear(3, motion_channels),
nn.ReLU(),
nn.Linear(motion_channels, motion_channels)
)
def forward(self, points, time_offsets, displacements=None):
"""
Args:
points: (N, 4) — x, y, z, intensity
time_offsets: (N,) — seconds relative to current
displacements: (N, 3) — motion vectors (if available from scene flow)
"""
time_features = self.time_encoder(time_offsets)
if displacements is not None:
motion_features = self.motion_mlp(displacements)
return torch.cat([points, time_features, motion_features], dim=-1)
else:
return torch.cat([points, time_features], dim=-1)5.4 4D Radar Temporal Accumulation
4D radar operates at 20 Hz with much sparser returns (~300 points/scan vs ~120K for LiDAR). Accumulation over 50-100 frames (2.5-5s) is standard:
Benefits for airside:
- 100-frame 4D radar accumulation: ~30K points (comparable to single LiDAR sweep)
- Doppler velocity is per-point: no need for multi-frame displacement estimation
- Immune to rain, fog, de-icing spray — temporal accumulation only improves already-clean data
- 5s history at 5 km/h = 7m baseline — excellent for slow-moving GSE tracking6. Streaming Perception and Latency Compensation
6.1 The Streaming Perception Problem
Traditional perception evaluation assumes zero inference latency — detections correspond exactly to the input frame. In reality, by the time inference completes, the world has moved:
Time t0 t0 + Δ t0 + 2Δ
| | |
Input: [Frame 0] ──inference──→ [Results 0] arrives
[Frame 1] ──inference──→ [Results 1] arrives
At time t0 + Δ:
- Results 0 is available but refers to state at t0
- Frame 1 is captured but results not yet available
- Real objects have moved Δ × velocity since frame 0At 25 km/h with 50ms inference latency, objects have moved 35 cm. At 5 km/h: 7 cm. For airside slow-speed operations, this matters less than highway driving, but still matters for close-proximity personnel detection.
6.2 ASAP Benchmark
The Autonomous-driving Streaming Perception (ASAP) benchmark evaluates perception under realistic latency constraints:
- Streaming AP (sAP): Evaluates detections against ground truth at the time results become available, not at input time
- Standard mAP drops 5-15% when evaluated as sAP
- Methods must either be fast (reduce Δ) or predict ahead (compensate for Δ)
6.3 Latency-Aware 3D Streaming Perception (LASP)
LASP (2025) addresses latency through:
- Latency-aware history integration: Extends query propagation into a continuous process using neural ODE, handling variable latency
- Latency-aware predictive detection: Predicts where objects will be at output time using trajectory estimation
class LatencyAwareDetector:
"""Compensate detection results for inference latency.
Simple but effective: shift detected objects by
velocity × latency_estimate.
"""
def __init__(self, default_latency_ms: float = 50.0):
self.default_latency = default_latency_ms / 1000.0
self.latency_estimator = ExponentialMovingAverage(alpha=0.1)
def compensate(self, detections, measured_latency_ms=None):
"""Shift detections forward in time by latency estimate.
Args:
detections: list of {center_xyz, velocity_xyz, ...}
measured_latency_ms: actual inference time (for EMA update)
"""
if measured_latency_ms is not None:
self.latency_estimator.update(measured_latency_ms / 1000.0)
latency = self.latency_estimator.value
else:
latency = self.default_latency
compensated = []
for det in detections:
det_comp = det.copy()
# Constant velocity prediction
det_comp['center_xyz'] = (
det['center_xyz'] + det['velocity_xyz'] * latency
)
# Also shift heading by yaw_rate if available
if 'yaw_rate' in det:
det_comp['heading'] = det['heading'] + det['yaw_rate'] * latency
compensated.append(det_comp)
return compensated6.4 Dual-Path Streaming (for Camera Stacks)
For camera-based systems with higher latency (100-200ms), dual-path approaches run a lightweight detector on latest frames while the full model processes slightly older frames:
Path 1 (heavy, 150ms): Full BEVFormer on frame t-1 → accurate detections (delayed)
Path 2 (light, 30ms): YOLO on frame t → fast detections (less accurate)
Fuse: Use heavy detections as prior, light detections for updateNot relevant for the reference airside AV stack's LiDAR stack where CenterPoint runs in 6.84ms, but useful if camera fallback is activated.
7. Temporal Occupancy Prediction
7.1 Connection to World Models
Temporal occupancy methods bridge perception and prediction — they estimate what the 3D occupancy grid will look like in the future.
Already covered extensively in:
30-autonomy-stack/world-models/occupancy-flow-4d-scenes.md(UnO, OccSora, scene flow)30-autonomy-stack/world-models/lidar-native-world-models.md(Copilot4D, LidarDM)
Key temporal aspects not covered elsewhere:
7.2 Streaming Occupancy Updates
Rather than recomputing full occupancy each frame, streaming updates modify the existing grid:
class StreamingOccupancyGrid:
"""Incrementally update 3D occupancy using temporal consistency."""
def __init__(self, grid_size=(200, 200, 16), voxel_size=0.4):
self.grid = np.zeros(grid_size, dtype=np.float32)
self.confidence = np.zeros(grid_size, dtype=np.float32)
self.update_count = np.zeros(grid_size, dtype=np.int32)
self.voxel_size = voxel_size
def update(self, new_observation, ego_motion):
"""Fuse new observation with existing grid.
Uses Bayesian update: P(occupied | obs1, obs2) =
P(obs2|occupied) * P(occupied|obs1) / P(obs2)
"""
# Warp existing grid by ego-motion
self.grid = self.warp_grid(self.grid, ego_motion)
self.confidence = self.warp_grid(self.confidence, ego_motion)
# Bayesian fusion (log-odds for numerical stability)
log_odds_prior = np.log(self.confidence / (1 - self.confidence + 1e-7) + 1e-7)
log_odds_obs = np.log(new_observation / (1 - new_observation + 1e-7) + 1e-7)
log_odds_posterior = log_odds_prior + log_odds_obs
self.confidence = 1 / (1 + np.exp(-log_odds_posterior))
self.update_count += (new_observation > 0.1).astype(np.int32)
def get_confident_occupancy(self, min_observations=3, min_confidence=0.7):
"""Return occupancy that meets both observation count and confidence."""
mask = (self.update_count >= min_observations) & \
(self.confidence >= min_confidence)
return self.grid * mask7.3 Temporal Filtering for Transient Noise
Airside-specific transient phenomena that temporal filtering eliminates:
| Phenomenon | Duration | LiDAR Effect | Temporal Filter |
|---|---|---|---|
| De-icing spray | 5-15s | Dense false point cloud | Require 3+ consistent frames |
| Jet exhaust shimmer | Continuous | Range noise ±0.5m | Median filter across 5 frames |
| Rain/puddle splash | <1s | Random false points | Require persistence > 2 frames |
| Dust plume (from taxiing) | 2-5s | Gradual density reduction | Track and classify as transient |
| Snow/hail | Continuous | Sparse random points | Statistical outlier removal |
class TemporalConsistencyFilter:
"""Filter detections requiring temporal consistency."""
def __init__(self, persistence_threshold=3, window_size=5):
self.persistence_threshold = persistence_threshold
self.window_size = window_size
self.detection_history = [] # List of detection sets per frame
def filter(self, current_detections, ego_compensator):
"""Keep only detections seen in >= persistence_threshold of last window_size frames."""
self.detection_history.append(current_detections)
if len(self.detection_history) > self.window_size:
self.detection_history.pop(0)
if len(self.detection_history) < self.persistence_threshold:
return current_detections # Not enough history
# Count how many frames each current detection appears in
filtered = []
for det in current_detections:
count = 0
for past_dets in self.detection_history[:-1]:
# Transform past detections to current frame
for past_det in past_dets:
if self._is_same_object(det, past_det, distance_thresh=1.0):
count += 1
break
if count >= self.persistence_threshold - 1: # -1 because current frame counts
filtered.append(det)
elif det['class'] in ['personnel', 'aircraft']:
# Safety-critical: lower threshold for people and aircraft
if count >= 1:
filtered.append(det)
return filtered8. Video Backbone Architectures
8.1 From Image to Video Backbones
Standard AV perception uses per-frame image backbones (ResNet, Swin, ViT). Video backbones process multiple frames jointly:
| Architecture | Type | Temporal Mechanism | FPS (A100) | Params | Key Paper |
|---|---|---|---|---|---|
| ResNet-50 | Image | None (single frame) | 200+ | 25M | He 2016 |
| Swin-T | Image | None | 120+ | 29M | Liu 2021 |
| TimeSformer | Video | Divided space-time attention | 45 | 121M | Bertasius 2021 |
| VideoMAE v2 | Video | Masked autoencoder pre-training | 30 | 633M | Wang 2023 |
| MViTv2-S | Video | Pooling attention + multi-scale | 60 | 35M | Li 2022 |
| InternVideo2 | Video | Unified 2D+3D with learnable gating | 20 | 6B | Wang 2024 |
| Hiera | Video | Hierarchical masked autoencoder | 80 | 51M | Ryali 2023 |
8.2 Practical Choice for Orin
Video backbones are NOT recommended for Orin-class hardware in the perception loop. The overhead of joint spatiotemporal attention is too high:
ResNet-50 on Orin INT8: ~4ms per frame
TimeSformer-S on Orin INT8: ~35ms per 4 frames (8.75ms/frame effective)
MViTv2-S on Orin FP16: ~25ms per 4 frames (6.25ms/frame effective)The latency increase eliminates the benefit of temporal information. Instead:
- Use per-frame backbone + query-level temporal fusion (StreamPETR approach)
- Or per-frame backbone + feature-level temporal concatenation (SOLOFusion/BEVDet4D)
8.3 Video Backbones for Offline Analysis
Video backbones are valuable for:
- Data engine: Auto-labeling using VideoMAE features for temporal consistency
- Anomaly detection: InternVideo2 for fleet-scale incident analysis
- VLM co-pilot: StreamingVLM for real-time video narration at 1-2 Hz (non-safety-critical path)
9. Temporal Fusion for Multi-Modal Stacks
9.1 Cross-Modal Temporal Alignment
When fusing LiDAR and camera (future reference airside AV stack capability), temporal alignment is critical:
LiDAR sweep: 100ms capture window, single timestamp at scan end
Camera frame: ~1ms exposure, effectively instantaneous
4D Radar: 50ms cycle
Misalignment at 25 km/h:
- LiDAR intra-scan: 0.7m max displacement
- LiDAR-camera sync error (1ms): 0.7cm (negligible)
- LiDAR-radar sync error (5ms): 3.5cm9.2 Temporal Feature Alignment via Deformable Attention
class TemporalCrossModalFusion(nn.Module):
"""Fuse temporal features from different modalities.
Each modality has its own temporal buffer; fusion happens
in BEV space after individual temporal processing.
"""
def __init__(self, embed_dim=256, num_modalities=3):
super().__init__()
# Per-modality temporal encoders
self.lidar_temporal = MultiSweepEncoder(embed_dim)
self.camera_temporal = BEVTemporalEncoder(embed_dim)
self.radar_temporal = AccumulationEncoder(embed_dim)
# Cross-modal fusion with temporal awareness
self.fusion = nn.TransformerDecoderLayer(
d_model=embed_dim, nhead=8, batch_first=True
)
# Modality timestamp encoding
self.time_embed = SinusoidalEncoding(embed_dim)
def forward(self, lidar_sweeps, camera_frames, radar_sweeps, timestamps):
"""Fuse temporally-encoded multi-modal features."""
# Each modality processes its temporal history independently
lidar_bev = self.lidar_temporal(lidar_sweeps) # (B, H*W, C)
camera_bev = self.camera_temporal(camera_frames) # (B, H*W, C)
radar_bev = self.radar_temporal(radar_sweeps) # (B, H*W, C)
# Add timestamp encoding to handle async sensors
lidar_bev += self.time_embed(timestamps['lidar'])
camera_bev += self.time_embed(timestamps['camera'])
radar_bev += self.time_embed(timestamps['radar'])
# Concatenate and fuse
multi_modal = torch.cat([lidar_bev, camera_bev, radar_bev], dim=1)
fused = self.fusion(lidar_bev, multi_modal) # LiDAR queries, MM keys
return fused9.3 Asynchronous Temporal Fusion
CoBEVFlow (from collaborative perception) handles asynchronous data from multiple vehicles — the same technique applies to asynchronous sensors on a single vehicle:
Key idea: Learn a BEV flow field to warp features from
different timestamps to a common reference time.
BEV(t) + BEV(t-δ) → FlowNet → Flow(δ) → Warp(BEV(t-δ), Flow(δ)) → BEV_aligned(t)10. Orin Deployment Strategies
10.1 Compute Budget for Temporal Methods
Starting from the reference airside AV stack's current pipeline and adding temporal capability:
| Component | Current (ms) | + Temporal (ms) | Method |
|---|---|---|---|
| Multi-sweep accumulation | — | +1.4 | 3-sweep ego-compensated concat |
| PointPillars/CenterPoint | 6.84 | 8.2 | 3-sweep input (wider pillars) |
| Temporal filtering | — | +0.5 | 5-frame persistence filter |
| Latency compensation | — | +0.1 | Constant velocity prediction |
| Total | 6.84 | 10.2 | +3.4ms overhead |
For future camera-augmented stack:
| Component | Latency (ms) | Notes |
|---|---|---|
| ResNet-50 backbone | 4.0 | INT8, per-frame |
| StreamPETR temporal | 2.5 | Query propagation + attention |
| BEV construction | 3.0 | Deformable spatial attention |
| Detection head | 1.0 | FP16 |
| Camera temporal total | 10.5 | Fits in 50ms cycle |
10.2 Memory Management
class OrinTemporalMemoryManager:
"""Manage temporal feature memory within Orin constraints.
Orin AGX 64GB: generous for temporal features
Orin NX 16GB: requires careful management
"""
MEMORY_BUDGET = {
'orin_agx_64gb': {
'lidar_sweeps': 5, # 5 × 120K × 5 × 4B = ~12 MB
'bev_cache': 3, # 3 × 200×200×256 × 2B = ~60 MB
'query_memory': 512, # 512 × 256 × 2B = ~256 KB
'occupancy_grid': 1, # 200×200×16 × 4B = ~10 MB
},
'orin_nx_16gb': {
'lidar_sweeps': 3, # 3 × 120K × 5 × 4B = ~7.2 MB
'bev_cache': 1, # 1 × 200×200×256 × 2B = ~20 MB
'query_memory': 256, # 256 × 256 × 2B = ~128 KB
'occupancy_grid': 1, # 200×200×16 × 4B = ~10 MB (reduced resolution)
}
}
def __init__(self, platform='orin_agx_64gb'):
self.budget = self.MEMORY_BUDGET[platform]
self.lidar_buffer = deque(maxlen=self.budget['lidar_sweeps'])
self.bev_cache = deque(maxlen=self.budget['bev_cache'])
self.query_memory = None10.3 TensorRT Considerations for Temporal Models
Temporal models have state — they are not pure functions. TensorRT engines are stateless. Solutions:
- External state management: Keep temporal buffers in CPU/GPU memory outside TensorRT engine
- Unrolled model: Export model with fixed temporal depth (e.g., always 3 frames)
- Plugin approach: Custom TensorRT plugin for temporal attention with state
# Recommended approach: External state + TensorRT core
class TemporalTRTRunner:
"""Run temporal perception with TensorRT engine + external state."""
def __init__(self, engine_path):
self.engine = load_trt_engine(engine_path)
self.temporal_state = {
'prev_bev': None,
'prev_queries': None,
'sweep_buffer': deque(maxlen=3)
}
def infer(self, current_input, ego_motion):
"""Single-frame inference with temporal state."""
# Prepare input: current + warped historical
if self.temporal_state['prev_bev'] is not None:
warped = warp_bev(self.temporal_state['prev_bev'], ego_motion)
trt_input = {
'current_features': current_input,
'history_bev': warped,
'history_queries': self.temporal_state['prev_queries']
}
else:
trt_input = {
'current_features': current_input,
'history_bev': torch.zeros_like(current_input),
'history_queries': torch.zeros(256, 256)
}
# Run TensorRT engine
outputs = self.engine.run(trt_input)
# Update state for next frame
self.temporal_state['prev_bev'] = outputs['bev_features']
self.temporal_state['prev_queries'] = outputs['top_k_queries']
return outputs['detections']11. Airside-Specific Temporal Challenges
11.1 Very Slow Object Speeds
Most temporal methods are tuned for highway/urban driving (30-130 km/h). Airside GSE operates at 1-25 km/h.
Impact on temporal perception:
- Motion estimation: 1 km/h = 0.028 m/frame at 10 Hz — below PointPillars voxel resolution (0.2m). Need >7 frames to detect 1-voxel displacement
- SOLOFusion stereo: 16 frames at 2 km/h = 0.9m baseline — insufficient for depth triangulation (need >3m for reasonable accuracy)
- Doppler: 4D radar Doppler at 1 km/h = 0.28 m/s — near noise floor. Need accumulation
Solutions:
- Increase temporal window: 5→10 frames for LiDAR accumulation at slow speeds
- Use GTSAM velocity directly (fused IMU/wheel/GNSS) rather than frame-to-frame displacement
- Weight Doppler observations by vehicle speed — ignore at <2 km/h
11.2 Extended Occlusion Durations
A baggage cart behind a B737 fuselage may be occluded for 30+ seconds while both move at 2 km/h. Standard track persistence (10 frames = 1s) loses the object.
class AirsideTrackPersistence:
"""Extended track persistence for airside occlusion patterns."""
# Airside-specific occlusion durations
PERSISTENCE_BY_CLASS = {
'aircraft': 300, # 30s — aircraft don't disappear
'baggage_tractor': 100, # 10s — may be behind fuselage
'belt_loader': 100, # 10s
'cargo_dolly': 150, # 15s — often in train behind tractor
'gpu': 50, # 5s — large, rarely fully occluded
'personnel': 30, # 3s — safety critical, shorter window
'fod': 200, # 20s — stationary, should persist
'catering_truck': 50, # 5s — large vehicle
'fuel_truck': 50, # 5s — large vehicle
'pushback_tug': 100, # 10s — often attached to aircraft
}
def get_max_age(self, object_class: str, confidence: float) -> int:
"""Get maximum track age (frames) before deletion.
Higher confidence tracks persist longer.
"""
base = self.PERSISTENCE_BY_CLASS.get(object_class, 30)
# Scale by confidence: high-confidence tracks persist 2x longer
return int(base * (0.5 + confidence))11.3 Dynamic vs. Quasi-Static Classification
Airside objects have bimodal speed distribution — either moving or parked. Temporal perception must classify this:
Moving: aircraft taxiing (5-30 km/h), baggage tractors (2-15 km/h), personnel walking (3-6 km/h)
Quasi-static: parked GSE, aircraft at stand, barriers, cones
Transition events (critical to detect):
- Tractor starts moving from parked (0→5 km/h in 2-3s)
- Personnel steps out from behind parked vehicle
- Pushback tug engages aircraft nose gear (coupled motion begins)11.4 Turnaround Scene Evolution
Airport turnarounds follow predictable temporal patterns (~30-60 min sequences):
Phase 1 (0-2 min): Aircraft arrives, stops, chocks placed
Phase 2 (2-5 min): Bridge connects, belt loader positions
Phase 3 (5-25 min): Baggage/cargo unload, catering, cleaning
Phase 4 (25-40 min): Baggage/cargo load, fuel truck
Phase 5 (40-55 min): Equipment withdrawal, pushback tug connects
Phase 6 (55-60 min): Pushback, taxi out
Temporal perception value: Predict next phase from current scene state
→ Pre-position autonomous GSE for next task
→ Anticipate equipment movement patterns
→ Detect turnaround anomalies (equipment in wrong position)12. Implementation Guide
12.1 ROS Integration Architecture
#!/usr/bin/env python
"""Temporal perception ROS node for reference airside AV stack."""
import rospy
from sensor_msgs.msg import PointCloud2
from nav_msgs.msg import Odometry
from std_msgs.msg import Header
from collections import deque
class TemporalPerceptionNode:
"""Multi-sweep temporal perception with streaming output."""
def __init__(self):
rospy.init_node('temporal_perception')
# Parameters
self.num_sweeps = rospy.get_param('~num_sweeps', 3)
self.persistence_frames = rospy.get_param('~persistence_frames', 5)
self.latency_compensation = rospy.get_param('~latency_compensation', True)
# Components
self.ego_compensator = EgoMotionCompensator(max_history=20)
self.sweep_accumulator = MultiSweepAccumulator(self.num_sweeps)
self.consistency_filter = TemporalConsistencyFilter(
persistence_threshold=3, window_size=self.persistence_frames
)
self.latency_compensator = LatencyAwareDetector(default_latency_ms=10.0)
# Track manager for implicit tracking
self.track_manager = TrackQueryManager(max_tracks=100, max_age=50)
# Subscribers
self.pc_sub = rospy.Subscriber(
'/merged_points', PointCloud2, self.pointcloud_callback, queue_size=1
)
self.odom_sub = rospy.Subscriber(
'/gtsam/odometry', Odometry, self.odometry_callback, queue_size=10
)
# Publishers
self.det_pub = rospy.Publisher(
'/temporal_detections', DetectionArray, queue_size=1
)
self.track_pub = rospy.Publisher(
'/tracks', TrackArray, queue_size=1
)
# Timing
self.inference_times = deque(maxlen=100)
def odometry_callback(self, msg):
"""Update ego-motion from GTSAM localization."""
T = odometry_to_matrix(msg)
self.ego_compensator.update_pose(msg.header.stamp.to_sec(), T)
def pointcloud_callback(self, msg):
"""Process incoming LiDAR sweep with temporal context."""
t_start = rospy.Time.now()
# Convert ROS msg to numpy
points = pointcloud2_to_array(msg)
timestamp = msg.header.stamp.to_sec()
T_world_ego = self.ego_compensator.pose_history[-1][1]
# Accumulate sweeps
self.sweep_accumulator.add_sweep(points, timestamp, T_world_ego)
accumulated = self.sweep_accumulator.get_accumulated()
# Run detector on accumulated point cloud
detections = self.detector.infer(accumulated)
# Temporal consistency filter
detections = self.consistency_filter.filter(
detections, self.ego_compensator
)
# Latency compensation
inference_time_ms = (rospy.Time.now() - t_start).to_sec() * 1000
self.inference_times.append(inference_time_ms)
if self.latency_compensation:
detections = self.latency_compensator.compensate(
detections, inference_time_ms
)
# Update tracks
self.track_manager.update(
[d['query'] for d in detections],
[d['score'] for d in detections]
)
# Publish
self.det_pub.publish(self.to_detection_msg(detections, msg.header))
self.track_pub.publish(self.to_track_msg(self.track_manager.active_tracks, msg.header))12.2 Phased Implementation Roadmap
| Phase | Duration | Deliverable | Cost | Impact |
|---|---|---|---|---|
| 1. Multi-sweep accumulation | 2 weeks | 3-sweep ego-compensated LiDAR | $5K | +2-3% NDS, velocity estimation |
| 2. Temporal consistency filter | 1 week | Persistence-based filtering | $2K | Eliminates transient false positives |
| 3. Latency compensation | 1 week | Constant-velocity prediction | $1K | -0.35m prediction error at 25 km/h |
| 4. Track persistence (airside) | 2 weeks | Extended occlusion handling | $5K | Object memory during 10-30s occlusion |
| 5. StreamPETR integration | 4 weeks | Full sparse query temporal | $15K | +6-8% NDS, implicit tracking |
| 6. Temporal occupancy | 3 weeks | Streaming occupancy grid | $10K | FOD persistence, area monitoring |
| Total | 13 weeks | Full temporal stack | $38K | +8-12% NDS, tracking, filtering |
12.3 Testing and Validation
# Temporal perception test scenarios for airside
TEMPORAL_TEST_SCENARIOS = {
'slow_approach': {
'description': 'Baggage tractor approaching at 2 km/h',
'metric': 'velocity_error < 0.5 km/h after 5 frames',
'frames': 20
},
'fuselage_occlusion': {
'description': 'Cart passes behind B737 fuselage',
'metric': 'track maintained during 10s occlusion, re-associated after',
'frames': 150
},
'personnel_crouching': {
'description': 'Ground crew crouches to inspect wheel',
'metric': 'detection maintained with confidence > 0.5',
'frames': 50
},
'deicing_spray': {
'description': 'De-icing spray creates point cloud artifact',
'metric': 'no false positive objects from spray',
'frames': 100
},
'turnaround_sequence': {
'description': 'Full 45-minute turnaround',
'metric': 'all GSE tracked, correct phase classification',
'frames': 27000
},
'start_from_parked': {
'description': 'Stationary GSE begins moving',
'metric': 'motion detected within 3 frames (0.3s)',
'frames': 30
}
}13. Key Takeaways
Multi-sweep LiDAR accumulation is the highest-value temporal method: 3-sweep ego-compensated concatenation adds only 1.4ms but provides velocity estimation, point densification, and +2.5% mAP. Implement this first.
StreamPETR provides the best temporal perception per FLOP: Object query propagation adds <3ms and 128 KB memory while providing +6-8% NDS and implicit multi-object tracking without explicit data association.
Sparse4D v3 is the sparse query SOTA: 71.9% NDS and 67.7% AMOTA on nuScenes test — transforms detection into a unified detection+tracking system with temporal denoising.
Dense BEV temporal fusion is too expensive for Orin: BEVFormer temporal self-attention requires 20 MB per cached frame. For LiDAR-primary stacks, multi-sweep accumulation achieves comparable benefit at 1/100th the memory.
Slow airside speeds require longer temporal windows: Standard 5-frame (0.5s) windows are insufficient for 1-5 km/h objects. Use 10+ frames (1s+) for displacement-based velocity, or leverage GTSAM velocity estimates directly.
Latency compensation is nearly free and always beneficial: Constant-velocity forward prediction by inference_time × velocity costs <0.1ms. At 25 km/h with 10ms inference, this recovers 7cm of prediction error.
Temporal filtering eliminates airside transient noise: De-icing spray, jet blast shimmer, puddle splashes, and dust plumes cause per-frame false positives that 3-frame persistence filtering removes at zero accuracy cost.
Extended track persistence is essential for airside: Standard 1s track timeout loses objects during 10-30s aircraft fuselage occlusions. Class-dependent persistence (30s for GSE, 3s for personnel) maintains tracks through typical airside occlusion patterns.
Video backbones are not appropriate for Orin real-time: TimeSformer/MViTv2 add 20-30ms overhead vs per-frame ResNet. Use per-frame backbone + query-level temporal fusion instead. Video backbones are valuable for offline data engine analysis.
4D radar temporal accumulation provides unique value: 100-frame (5s) radar accumulation creates dense point clouds (~30K points) immune to weather, de-icing, and jet blast — complementary to LiDAR temporal methods.
Turnaround phase detection is a temporal perception application: 30-60 minute turnaround sequences follow predictable patterns. Scene-level temporal understanding enables predictive GSE positioning and anomaly detection.
Intra-scan motion correction matters for multi-LiDAR fusion: 100ms RoboSense rotation at 25 km/h creates 0.7m intra-scan distortion. IMU-based per-point motion correction is essential for temporal consistency.
TensorRT requires external state management for temporal models: TRT engines are stateless. Keep temporal buffers (BEV cache, query memory) in external GPU memory and pass as additional inputs.
Streaming occupancy grids provide persistent area monitoring: Bayesian occupancy updates across frames enable FOD persistence monitoring and area-based anomaly detection that single-frame detection cannot provide.
Asynchronous multi-modal temporal fusion uses flow-based warping: When LiDAR (10 Hz), camera (30 Hz), and radar (20 Hz) arrive at different times, learned BEV flow compensates for temporal misalignment.
Total implementation: $38K over 13 weeks: Phased from simple multi-sweep (2 weeks, $5K) through full StreamPETR integration (4 weeks, $15K). Each phase delivers independent value.
14. References
- Wang et al., "StreamPETR: Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection," ICCV 2023
- Li et al., "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers," ECCV 2022
- Park et al., "SOLOFusion: Time Will Tell — New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection," ICLR 2023
- Lin et al., "Sparse4D v3: Advancing End-to-End 3D Detection and Tracking," arXiv 2024
- Yin et al., "Center-based 3D Object Detection and Tracking," CVPR 2021
- Chen et al., "ASAP: Are We Ready for Vision-Centric Driving Streaming Perception?" CVPR 2023
- Li et al., "LASP: Towards Latency-aware 3D Streaming Perception for Autonomous Driving," arXiv 2025
- Zhang et al., "MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries," CVPR 2022
- Wang et al., "StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams," NeurIPS 2023
- Xu et al., "CoBEVFlow: Robust Asynchronous Collaborative Perception via BEV Flow," NeurIPS 2023
- Tang et al., "BEVDet4D: 3D Object Detection with Temporal Cues in Multi-Camera BEV Space," arXiv 2022
- Li et al., "MViTv2: Improved Multiscale Vision Transformers," CVPR 2022
- Bertasius et al., "Is Space-Time Attention All You Need for Video Understanding?" ICML 2021