NVIDIA's LocateAnything-3B Delivers One-Step Visual Grounding

LocateAnything-3B is a new vision‑language model from NVIDIA that finds and marks objects, text, or interface elements in images based on simple text prompts. Instead of predicting coordinates word‑by‑word like many earlier models, it produces full bounding boxes in a single parallel step. This parallel approach makes visual grounding significantly faster while keeping the location data accurate and consistent across complex scenes.
NVIDIA who also recently released PiD, built LocateAnything-3B primarily as a research tool and released it for non‑commercial use. The model was trained on a massive dataset of 12 million images that span natural scenes, driving, documents, and graphical user interfaces. It aims to give developers and researchers a fast, local alternative for vision tasks that require precise spatial understanding from language instructions.
Parallel box decoding cuts down response time
- Parallel bounding box generation in one step.
- Up to 2.5× faster throughput than prior methods.
- Runs on local GPUs like RTX 4090.
- Detects objects, text, and GUI elements.
- Handles dense, cluttered scenes with many objects.
- Three decoding modes for speed‑accuracy tradeoffs.
- Trained on 12 million diverse images.
- Open‑set detection from natural language prompts.
The tool is aimed at researchers and hobbyists who need fast visual localization without sending data to the cloud. Privacy‑conscious professionals can run it entirely on their own hardware, avoiding external API calls. Small agencies can use it to automate annotation, test graphical interfaces, or prototype vision pipelines with consumer‑grade GPUs.
Developer notes and known limits
The model is licensed for non‑commercial use only, limiting it to academic and research projects for now. Deployment requires a Linux system and a recent NVIDIA GPU, though it runs on consumer cards like the RTX 4090 as well as data‑center hardware. TensorRT optimization isn’t available yet, but the hybrid generation mode is recommended for the best mix of speed and accuracy in most situations.
“Its core innovation, Parallel Box Decoding (PBD), predicts complete bounding box coordinates in a single parallel step rather than autoregressive token-by-token decoding, improving efficiency while preserving geometric consistency.” — Source: Hugging Face