Skip to content

Open3DTrack

What It Is

  • Open3DTrack is a 2024-2025 open-vocabulary 3D multi-object tracking method.
  • The full title is "Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking."
  • It formulates the open-vocabulary 3D tracking task: track known and novel object categories in 3D space.
  • It introduces dataset splits for open-vocabulary tracking scenarios.
  • It adapts a 3D tracking framework with open-vocabulary 2D detections and tracking-specific scoring.
  • It fills the gap between open-vocabulary 3D detection and persistent 3D tracks.

Core Technical Idea

  • Use 2D open-vocabulary detections to provide category information for object classes not covered by a closed-set 3D detector.
  • Link those categories to 3D object proposals from existing 3D detectors.
  • Train the tracker to operate more class-agnostically so it can preserve trajectories for unseen classes.
  • Add confidence score prediction because 2D open-vocabulary confidence does not directly represent 3D proposal objectness.
  • Add track consistency scoring to stabilize labels and identities over time.
  • Evaluate base and novel classes separately so average tracking metrics do not hide novel-class collapse.

Inputs and Outputs

  • Input: 3D object proposals from a detector such as CenterPoint, MEGVII, or BEVFusion.
  • Input: 2D open-vocabulary detections or class prompts, for example from a YOLO-World-style detector.
  • Input metadata: camera-LiDAR calibration, timestamps, ego pose, and frame sequence.
  • Training input: 3D tracking labels for base classes and pseudo labels from 2D open-vocabulary detections.
  • Output: 3D object tracks with positions, velocities, class labels, and confidence scores.
  • Output: tracks for both known and novel categories under the evaluation split.
  • It does not produce dense occupancy or freespace.

Architecture or Pipeline

  • Generate 3D proposals from a standard 3D detector.
  • Run 2D open-vocabulary detection over camera frames for base and novel categories.
  • Associate 2D detections with 3D proposals through projection and matching.
  • Use a 3DMOTFormer-style tracking framework as the base tracker.
  • Remove class-specific assumptions where possible and apply class-agnostic ground-truth assignment.
  • Predict proposal confidence scores for 3D tracking rather than inheriting unreliable 2D scores.
  • Use track consistency scoring so unknown detections receive stable labels across frames.

Training and Evaluation

  • Open3DTrack evaluates on nuScenes with open-vocabulary tracking splits.
  • The paper reports overall AMOTA values around 0.567, 0.590, and 0.536 across three splits after adaptation.
  • It evaluates generalization across different 3D proposal sources, including CenterPoint, MEGVII, and BEVFusion.
  • Ablations identify confidence score prediction and track consistency scoring as important for novel-class tracking.
  • Novel-class AMOTA, AMOTP, identity switches, and class stability should be reported separately from base classes.
  • Performance can change depending on proposal quality and how 2D open-vocabulary detections are lifted to 3D.

Strengths

  • Makes open-vocabulary 3D perception persistent over time instead of frame-local.
  • Compatible with mature 3D proposal detectors.
  • Separates the objectness/proposal problem from the open-vocabulary semantic labeling problem.
  • Track consistency helps reduce flicker for novel categories.
  • Useful for active learning because novel-category tracks are easier to review than isolated detections.
  • Provides evaluation splits that make closed-set overfitting visible.

Failure Modes

  • Novel objects still depend on the 3D proposal detector generating a usable box.
  • 2D-to-3D association is sensitive to calibration, occlusion, and sparse LiDAR returns.
  • Open-vocabulary 2D labels can be unstable across views and frames.
  • Class-agnostic tracking can improve continuity while increasing localization error for some categories.
  • The method tracks boxes, so irregular objects such as tow bars, hoses, chocks, and aircraft parts may be poorly represented.
  • It does not prove freespace or occupancy absence.

Airside AV Fit

  • Strong fit for long-tail GSE and temporary objects that are not in road-driving taxonomies.
  • Useful for tracking rare equipment once detected: lavatory trucks, GPUs, tow bars, belt loaders, dollies, cones, chocks, and maintenance stands.
  • Persistent open-vocabulary tracks can feed operator review and data labeling workflows.
  • The method should be paired with dense LiDAR/radar occupancy near aircraft and personnel.
  • Airside prompts and class names need a controlled synonym list so tracks do not change labels every frame.
  • Novel tracks should trigger conservative behavior only when their geometry intersects the path or no-go buffer.

Implementation Notes

  • Maintain separate confidence fields for 3D proposal objectness, open-vocabulary semantic score, and track consistency.
  • Store the text prompt or vocabulary item that created each novel-class label.
  • Validate camera-LiDAR projection under vibration, thermal drift, and wide-FOV camera distortion.
  • Use airside-specific 3D size priors carefully; do not force unknown objects into road-vehicle dimensions.
  • Review false novel tracks around aircraft liveries, signage, reflections, and painted ramp markings.
  • Feed high-value novel tracks into the data flywheel for class promotion and retraining.

Sources

Public research notes collected from public sources.