The system combines open-vocabulary prompting with monocular 3D geometry reasoning. Text prompts specify categories, point or box prompts provide spatial guidance, and optional depth can improve geometric estimates. The technical challenge is inferring 3D position, extent, and object identity from limited visual evidence while remaining flexible across categories not fixed at training time.
WildDet3D is valuable for robotics, AR, mapping, embodied AI, and scene-understanding systems. It enables more flexible 3D perception because users can prompt for objects instead of relying only on a closed detector label set.


