SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

2025-10-08

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Summary

This paper focuses on the challenge of getting enough good data to train computer vision models, specifically those that identify objects in images. It introduces a new way to measure how useful fake, or synthetic, data is for improving these models.

What's the problem?

Machine learning models need lots of data to learn effectively, but getting enough real-world data that's also accurately labeled can be really hard and expensive. Researchers are turning to creating synthetic data, but it's difficult to know if that synthetic data is actually *good* enough to help the model learn. Existing methods for checking synthetic data quality often require fully training a model, which takes a lot of time and computing power.

What's the solution?

The researchers developed a new metric called the Synthetic Dataset Quality Metric, or SDQM. This metric assesses the quality of synthetic data for object detection *without* needing to train a model all the way to completion. It looks at characteristics of the synthetic data itself to predict how well it will perform. They tested SDQM and found it accurately predicted how well a state-of-the-art object detection model, YOLOv11, would perform, better than previous methods.

Why it matters?

This is important because it provides a faster and more efficient way to evaluate synthetic data. This means researchers can quickly improve their synthetic datasets, leading to better performing object detection models, especially in situations where getting real-world data is difficult or expensive. It also reduces the need for repeated, time-consuming model training to test data quality.

Abstract

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

View Paper