Skip to content

DetAny3D

What It Is

  • DetAny3D is a promptable monocular 3D detection foundation model.
  • The paper title is "Detect Anything 3D in the Wild".
  • It aims to detect novel 3D objects under arbitrary camera configurations from a single RGB image.
  • It supports prompt-driven detection rather than a fixed closed-set class head.
  • The official repository is released by OpenDriveLab and labels the work as ICCV 2025.
  • The method is designed to transfer broad 2D foundation-model knowledge into 3D detection.
  • The motivating use case is rare-object and open-world 3D detection when annotated 3D data is scarce.

Core Technical Idea

  • Use extensively pretrained 2D foundation models to compensate for limited 3D annotations.
  • Embed the image with complementary 2D models that capture low-level promptable detail and high-level geometry.
  • Align heterogeneous 2D features through a 2D Aggregator.
  • Convert the fused representation into 3D detection outputs through a 3D Interpreter.
  • Use Zero-Embedding Mapping to reduce catastrophic forgetting during 2D-to-3D transfer.
  • Accept box, point, text, and optional intrinsic prompts to specify target objects and camera geometry.
  • Train on a unified multi-dataset 3D detection corpus rather than a single benchmark.

Inputs and Outputs

  • Inputs are monocular RGB images.
  • Optional prompts include 2D boxes, points, text prompts, and camera intrinsics.
  • If intrinsics are missing, the model can estimate them and produce calibrated detections.
  • Depth files are used in parts of the training setup but are not required for basic inference in the repository.
  • Outputs are 3D bounding boxes and associated detection scores for prompted objects.
  • Outputs can be evaluated in Omni3D-style or OVMono3D-style 3D detection protocols.

Architecture or Evaluation Protocol

  • The paper describes SAM as the low-level promptable visual backbone.
  • It uses depth-pretrained DINO-style features for geometric knowledge.
  • The 2D Aggregator hierarchically aligns low-level and high-level 2D features with cross-attention.
  • The 3D Interpreter maps the aggregated 2D features into 3D box predictions.
  • Zero-Embedding Mapping is used to preserve useful 2D priors while learning 3D-specific representations.
  • Evaluation separates in-domain, novel-category, and novel-camera-configuration performance.
  • Prompt strategies include Grounding DINO prompts, ground-truth prompts, and detector-generated prompts.

Training and Evaluation

  • The paper builds DA3D, a unified training setup aggregating 16 diverse datasets.
  • Datasets referenced in the paper and repo include KITTI, nuScenes, Waymo, Cityscapes3D, 3RScan, Hypersim, Objectron, ARKitScenes, SUN RGB-D, and depth/intrinsic sources.
  • Training standardizes monocular images, camera intrinsics, 3D boxes, and depth maps.
  • The repository reports full code, training scripts, inference scripts, and released model weights.
  • The authors report state-of-the-art performance on unseen categories and novel camera configurations.
  • The repo notes that zero-shot evaluation still requires manual integration with external evaluation scripts in some cases.

Strengths

  • Directly addresses arbitrary camera configuration, a common weakness of monocular 3D detectors.
  • Prompt support makes it useful with upstream 2D open-vocabulary detectors.
  • The feature-transfer design leverages mature 2D foundation models rather than requiring massive native 3D labels.
  • The multi-dataset DA3D setup improves coverage of indoor, outdoor, driving, and object-centric scenes.
  • Optional intrinsics handling makes it more flexible for mixed camera fleets.
  • Official code and weights reduce the barrier to reproduction and fine-tuning.

Failure Modes

  • Monocular depth and scale remain brittle when camera geometry is wrong or visually ambiguous.
  • Prompt quality strongly controls output quality; weak text or noisy 2D boxes propagate into 3D boxes.
  • The official repo still lists conversion and evaluation simplification tasks as in progress.
  • Training is compute-intensive and depends on many third-party datasets and checkpoints.
  • Foundation-model biases may underrepresent airport-specific equipment and unusual materials.
  • It is a detector, not a tracker or occupancy safety layer.

Airside AV Fit

  • DetAny3D is relevant for rare equipment detection from apron cameras where fixed taxonomies miss objects.
  • Arbitrary-camera handling matters for mixed vehicle cameras, fixed stand cameras, and temporary sensors.
  • Prompted 3D boxes can help bootstrap labels for chocks, tow bars, ULDs, dollies, cones, and belt loaders.
  • Runtime airside use would need measured latency, calibration stability, and failure detection.
  • It should be fused with LiDAR/radar obstacle layers before being trusted for vehicle motion decisions.
  • The best near-term use is offline data mining and human-in-the-loop annotation of open-world objects.

Implementation Notes

  • Install and version third-party dependencies such as SAM, UniDepth, Grounding DINO, and DINO checkpoints explicitly.
  • Keep a prompt provenance field for every detection so reviewers know whether a box came from text, point, or 2D detector input.
  • Prefer real camera intrinsics where available; treat predicted intrinsics as a fallback.
  • Validate on airport-specific camera heights and lens fields of view before any transfer claims.
  • Run separate tests for novel categories and novel camera configurations because they stress different model assumptions.
  • If used for dataset creation, require human verification of 3D size, orientation, and ground contact.

Sources

Public research notes collected from public sources.