DetAny3D

What It Is

DetAny3D is a promptable monocular 3D detection foundation model.
The paper title is "Detect Anything 3D in the Wild".
It aims to detect novel 3D objects under arbitrary camera configurations from a single RGB image.
It supports prompt-driven detection rather than a fixed closed-set class head.
The official repository is released by OpenDriveLab and labels the work as ICCV 2025.
The method is designed to transfer broad 2D foundation-model knowledge into 3D detection.
The motivating use case is rare-object and open-world 3D detection when annotated 3D data is scarce.

Use extensively pretrained 2D foundation models to compensate for limited 3D annotations.
Embed the image with complementary 2D models that capture low-level promptable detail and high-level geometry.
Align heterogeneous 2D features through a 2D Aggregator.
Convert the fused representation into 3D detection outputs through a 3D Interpreter.
Use Zero-Embedding Mapping to reduce catastrophic forgetting during 2D-to-3D transfer.
Accept box, point, text, and optional intrinsic prompts to specify target objects and camera geometry.
Train on a unified multi-dataset 3D detection corpus rather than a single benchmark.

Inputs are monocular RGB images.
Optional prompts include 2D boxes, points, text prompts, and camera intrinsics.
If intrinsics are missing, the model can estimate them and produce calibrated detections.
Depth files are used in parts of the training setup but are not required for basic inference in the repository.
Outputs are 3D bounding boxes and associated detection scores for prompted objects.
Outputs can be evaluated in Omni3D-style or OVMono3D-style 3D detection protocols.

The paper describes SAM as the low-level promptable visual backbone.
It uses depth-pretrained DINO-style features for geometric knowledge.
The 2D Aggregator hierarchically aligns low-level and high-level 2D features with cross-attention.
The 3D Interpreter maps the aggregated 2D features into 3D box predictions.
Zero-Embedding Mapping is used to preserve useful 2D priors while learning 3D-specific representations.
Evaluation separates in-domain, novel-category, and novel-camera-configuration performance.
Prompt strategies include Grounding DINO prompts, ground-truth prompts, and detector-generated prompts.

The paper builds DA3D, a unified training setup aggregating 16 diverse datasets.
Datasets referenced in the paper and repo include KITTI, nuScenes, Waymo, Cityscapes3D, 3RScan, Hypersim, Objectron, ARKitScenes, SUN RGB-D, and depth/intrinsic sources.
Training standardizes monocular images, camera intrinsics, 3D boxes, and depth maps.
The repository reports full code, training scripts, inference scripts, and released model weights.
The authors report state-of-the-art performance on unseen categories and novel camera configurations.
The repo notes that zero-shot evaluation still requires manual integration with external evaluation scripts in some cases.

Directly addresses arbitrary camera configuration, a common weakness of monocular 3D detectors.
Prompt support makes it useful with upstream 2D open-vocabulary detectors.
The feature-transfer design leverages mature 2D foundation models rather than requiring massive native 3D labels.
The multi-dataset DA3D setup improves coverage of indoor, outdoor, driving, and object-centric scenes.
Optional intrinsics handling makes it more flexible for mixed camera fleets.
Official code and weights reduce the barrier to reproduction and fine-tuning.

Monocular depth and scale remain brittle when camera geometry is wrong or visually ambiguous.
Prompt quality strongly controls output quality; weak text or noisy 2D boxes propagate into 3D boxes.
The official repo still lists conversion and evaluation simplification tasks as in progress.
Training is compute-intensive and depends on many third-party datasets and checkpoints.
Foundation-model biases may underrepresent airport-specific equipment and unusual materials.
It is a detector, not a tracker or occupancy safety layer.

DetAny3D is relevant for rare equipment detection from apron cameras where fixed taxonomies miss objects.
Arbitrary-camera handling matters for mixed vehicle cameras, fixed stand cameras, and temporary sensors.
Prompted 3D boxes can help bootstrap labels for chocks, tow bars, ULDs, dollies, cones, and belt loaders.
Runtime airside use would need measured latency, calibration stability, and failure detection.
It should be fused with LiDAR/radar obstacle layers before being trusted for vehicle motion decisions.
The best near-term use is offline data mining and human-in-the-loop annotation of open-world objects.

Install and version third-party dependencies such as SAM, UniDepth, Grounding DINO, and DINO checkpoints explicitly.
Keep a prompt provenance field for every detection so reviewers know whether a box came from text, point, or 2D detector input.
Prefer real camera intrinsics where available; treat predicted intrinsics as a fallback.
Validate on airport-specific camera heights and lens fields of view before any transfer claims.
Run separate tests for novel categories and novel camera configurations because they stress different model assumptions.
If used for dataset creation, require human verification of 3D size, orientation, and ground contact.