RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

2025-04-22

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and
CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection
in Complex Orchard Environments Under Label Ambiguity

Summary

This paper talks about a comparison between two AI models, RF-DETR and YOLOv12, to see which one is better at finding and identifying green fruits in orchards, even when the fruits are hidden by leaves or hard to see.

What's the problem?

The problem is that in real orchards, fruits are often covered up by leaves, blend into the background, or are labeled in confusing ways, making it very challenging for computer models to spot and correctly identify them, especially when there are different types of fruits or when some are hidden.

What's the solution?

The researchers tested both RF-DETR, which uses a transformer-based design that is good at understanding the whole scene, and YOLOv12, which uses a CNN-based design that is fast and efficient. They created a custom dataset with both single-type and multiple-type fruit situations, including cases where fruits were hidden or hard to see. RF-DETR turned out to be better at finding and telling apart different fruits, especially when things got complicated, while YOLOv12 was faster and better suited for situations where speed is more important than perfect accuracy.

Why it matters?

This matters because using the right model can help farmers and researchers accurately count and monitor fruits in orchards, leading to better crop management and harvests, and it also shows how different AI designs have their own strengths depending on the real-world challenges they face.

Abstract

RF-DETR outperforms YOLOv12 in detecting greenfruits in complex orchard environments with label ambiguity, occlusions, and background blending, particularly in multi-class scenarios.

View Paper