WildDet3D

What It Is

WildDet3D is a promptable monocular 3D object detection model for in-the-wild images.
It predicts metric 3D boxes from a single RGB image and can use optional depth or camera intrinsics.
The method supports text prompts, point prompts, and 2D box prompts.
It is open-vocabulary in the sense that prompts can specify categories outside narrow 3D detection taxonomies.
The paper also introduces WildDet3D-Data and WildDet3D-Bench.
WildDet3D-Data is described as over 1M images across 13.5K categories.
The model targets spatial intelligence use cases such as robotics, AR, and open-world perception.

Unify promptable 2D object localization with monocular 3D box regression.
Use strong pretrained visual backbones for semantic features and depth/geometric features.
Allow optional depth signals at inference time to resolve monocular scale ambiguity.
Convert different prompt types into a shared prompt representation for the detector.
Predict 2D localization, 3D box geometry, depth-related quantities, and confidence jointly.
Scale open-world training by generating 3D boxes from existing 2D annotations and human verification.
Evaluate both prompt flexibility and zero-shot geometry transfer.

Inputs include an RGB image plus one or more prompts: text, point, or box.
Camera intrinsics can be supplied; if missing, the model can estimate intrinsics internally.
Optional sparse or dense depth can be fused to improve metric 3D localization.
Outputs include 2D boxes, 3D bounding boxes, depth maps, and predicted intrinsics.
The 3D box output contains position, dimensions, and orientation in camera-centric coordinates.
Scores combine 2D objectness and 3D confidence.

The public model card describes a SAM 3 ViT backbone for visual features.
A depth backend based on LingBot-Depth and DINOv2-style features supplies geometric latents.
A depth fusion module injects depth information into visual features without destabilizing pretrained features.
The promptable detector encodes text prompts and geometric prompts into a unified sequence.
The 3D head uses camera geometry, depth latents, and decoder features to regress 3D boxes.
The paper normalizes box orientation to reduce ambiguity for symmetric boxes.
Evaluation uses WildDet3D-Bench, Omni3D, Argoverse 2, ScanNet, and depth-augmented settings.

WildDet3D-Data is built by generating candidate 3D boxes from 2D annotations and retaining human-verified boxes.
The paper reports 22.6 AP3D with text prompts and 24.8 AP3D with box prompts on WildDet3D-Bench.
On Omni3D, it reports 34.2 AP3D with text prompts and 36.4 AP3D with box prompts.
In zero-shot evaluation, it reports 40.3 ODS on Argoverse 2 and 48.9 ODS on ScanNet.
Adding depth cues at inference gives a reported average gain of 20.7 AP across settings.
The model card reports about 1.2B parameters, so training and inference are compute-heavy.

Handles multiple interaction modes in one 3D detector.
Can exploit real depth when available, but does not require it for every image.
The large, diverse data pipeline directly attacks the long-tail 3D annotation bottleneck.
Text and box prompts make it compatible with open-vocabulary 2D detectors and human review tools.
Intrinsics prediction helps with web or ad hoc imagery where camera metadata is missing.
Strong zero-shot results indicate useful transfer beyond the training domain.

Monocular 3D remains sensitive to scale, camera pitch, and object truncation when depth is absent.
Optional depth improves accuracy, so camera-only performance may not meet safety margins.
Text prompts can match visually similar but operationally distinct objects.
The system is heavy for edge deployment without distillation or batching.
Single-image inference does not enforce temporal consistency.
Licensing and dependency constraints around foundation-model weights must be checked before production use.

WildDet3D is promising for operator-assisted labeling of rare apron objects from camera footage.
Text and box prompts can bootstrap 3D annotations for equipment not present in closed-set driving datasets.
The optional depth path maps well to camera-plus-LiDAR airside vehicles.
It should be validated on aircraft-scale scenes, low-profile equipment, night glare, rain, and reflective surfaces.
It is less directly suited to hard real-time safety perception until latency and calibration behavior are measured.
A practical airside stack would use WildDet3D as an offline mining or assisted-perception module first.

Provide calibrated intrinsics and sparse LiDAR depth whenever possible; do not rely on intrinsics prediction in safety runs.
Normalize prompts with airport-specific synonyms such as "baggage cart", "dolly", and "ULD cart".
Store the prompt text, prompt type, depth availability, and camera metadata with each prediction for auditability.
Add temporal smoothing or tracker association if outputs feed runtime obstacle reasoning.
Benchmark separately for prompt modes; box-prompt success does not imply text-prompt success.
Use it to generate candidate labels, then require human or rule-based verification before training production detectors.