Skip to content

WildDet3D

What It Is

  • WildDet3D is a promptable monocular 3D object detection model for in-the-wild images.
  • It predicts metric 3D boxes from a single RGB image and can use optional depth or camera intrinsics.
  • The method supports text prompts, point prompts, and 2D box prompts.
  • It is open-vocabulary in the sense that prompts can specify categories outside narrow 3D detection taxonomies.
  • The paper also introduces WildDet3D-Data and WildDet3D-Bench.
  • WildDet3D-Data is described as over 1M images across 13.5K categories.
  • The model targets spatial intelligence use cases such as robotics, AR, and open-world perception.

Core Technical Idea

  • Unify promptable 2D object localization with monocular 3D box regression.
  • Use strong pretrained visual backbones for semantic features and depth/geometric features.
  • Allow optional depth signals at inference time to resolve monocular scale ambiguity.
  • Convert different prompt types into a shared prompt representation for the detector.
  • Predict 2D localization, 3D box geometry, depth-related quantities, and confidence jointly.
  • Scale open-world training by generating 3D boxes from existing 2D annotations and human verification.
  • Evaluate both prompt flexibility and zero-shot geometry transfer.

Inputs and Outputs

  • Inputs include an RGB image plus one or more prompts: text, point, or box.
  • Camera intrinsics can be supplied; if missing, the model can estimate intrinsics internally.
  • Optional sparse or dense depth can be fused to improve metric 3D localization.
  • Outputs include 2D boxes, 3D bounding boxes, depth maps, and predicted intrinsics.
  • The 3D box output contains position, dimensions, and orientation in camera-centric coordinates.
  • Scores combine 2D objectness and 3D confidence.

Architecture or Evaluation Protocol

  • The public model card describes a SAM 3 ViT backbone for visual features.
  • A depth backend based on LingBot-Depth and DINOv2-style features supplies geometric latents.
  • A depth fusion module injects depth information into visual features without destabilizing pretrained features.
  • The promptable detector encodes text prompts and geometric prompts into a unified sequence.
  • The 3D head uses camera geometry, depth latents, and decoder features to regress 3D boxes.
  • The paper normalizes box orientation to reduce ambiguity for symmetric boxes.
  • Evaluation uses WildDet3D-Bench, Omni3D, Argoverse 2, ScanNet, and depth-augmented settings.

Training and Evaluation

  • WildDet3D-Data is built by generating candidate 3D boxes from 2D annotations and retaining human-verified boxes.
  • The paper reports 22.6 AP3D with text prompts and 24.8 AP3D with box prompts on WildDet3D-Bench.
  • On Omni3D, it reports 34.2 AP3D with text prompts and 36.4 AP3D with box prompts.
  • In zero-shot evaluation, it reports 40.3 ODS on Argoverse 2 and 48.9 ODS on ScanNet.
  • Adding depth cues at inference gives a reported average gain of 20.7 AP across settings.
  • The model card reports about 1.2B parameters, so training and inference are compute-heavy.

Strengths

  • Handles multiple interaction modes in one 3D detector.
  • Can exploit real depth when available, but does not require it for every image.
  • The large, diverse data pipeline directly attacks the long-tail 3D annotation bottleneck.
  • Text and box prompts make it compatible with open-vocabulary 2D detectors and human review tools.
  • Intrinsics prediction helps with web or ad hoc imagery where camera metadata is missing.
  • Strong zero-shot results indicate useful transfer beyond the training domain.

Failure Modes

  • Monocular 3D remains sensitive to scale, camera pitch, and object truncation when depth is absent.
  • Optional depth improves accuracy, so camera-only performance may not meet safety margins.
  • Text prompts can match visually similar but operationally distinct objects.
  • The system is heavy for edge deployment without distillation or batching.
  • Single-image inference does not enforce temporal consistency.
  • Licensing and dependency constraints around foundation-model weights must be checked before production use.

Airside AV Fit

  • WildDet3D is promising for operator-assisted labeling of rare apron objects from camera footage.
  • Text and box prompts can bootstrap 3D annotations for equipment not present in closed-set driving datasets.
  • The optional depth path maps well to camera-plus-LiDAR airside vehicles.
  • It should be validated on aircraft-scale scenes, low-profile equipment, night glare, rain, and reflective surfaces.
  • It is less directly suited to hard real-time safety perception until latency and calibration behavior are measured.
  • A practical airside stack would use WildDet3D as an offline mining or assisted-perception module first.

Implementation Notes

  • Provide calibrated intrinsics and sparse LiDAR depth whenever possible; do not rely on intrinsics prediction in safety runs.
  • Normalize prompts with airport-specific synonyms such as "baggage cart", "dolly", and "ULD cart".
  • Store the prompt text, prompt type, depth availability, and camera metadata with each prediction for auditability.
  • Add temporal smoothing or tracker association if outputs feed runtime obstacle reasoning.
  • Benchmark separately for prompt modes; box-prompt success does not imply text-prompt success.
  • Use it to generate candidate labels, then require human or rule-based verification before training production detectors.

Sources

Public research notes collected from public sources.