RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, Neehar Peri

2025-11-17

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Summary

This paper introduces RF-DETR, a new method for object detection that aims to perform well even when faced with objects it hasn't been specifically trained on, and to do so efficiently.

What's the problem?

Current object detection systems, especially those that can identify a wide range of objects (open-vocabulary detectors), often struggle when applied to new, real-world situations. These situations frequently contain objects or categories the system wasn't originally trained to recognize. Simply retraining a large, complex model for each new situation is time-consuming and resource intensive. Existing methods aren't always great at balancing speed and accuracy when adapting to these new scenarios.

What's the solution?

The researchers developed RF-DETR, which is a smaller, specialized detection system built on top of a pre-trained base. Instead of fully retraining the entire model, RF-DETR quickly searches through many different possible configurations for its internal structure, finding the best balance between how accurately it detects objects and how fast it runs. This search doesn't require repeatedly training the model from scratch for each configuration, making it much faster. They also refined how this search process works to make it more effective at adapting to different types of data.

Why it matters?

RF-DETR represents a significant step forward in making object detection more practical for real-world applications. It achieves higher accuracy and faster speeds than previous methods on standard datasets like COCO and Roboflow100-VL, and is the first real-time detector to exceed 60 AP on COCO. This means it can identify objects more reliably and quickly, which is crucial for applications like robotics, self-driving cars, and automated surveillance.

Abstract

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr

View Paper