Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
2025-09-12
Summary
This paper introduces a new dataset and benchmark, called MMOral, specifically designed to test and improve how well artificial intelligence understands dental X-rays, and presents a new model, OralGPT, built using this dataset.
What's the problem?
Current large vision-language models (LVLMs) are good at general medical tasks, but they struggle with the specific challenges of interpreting dental panoramic X-rays. These X-rays are complex, showing many overlapping structures and often have very subtle signs of disease that existing AI models miss because they haven't been trained on enough relevant data. There simply wasn't a large, specialized dataset available for training and evaluating AI in this area.
What's the solution?
The researchers created MMOral, a large dataset containing over 20,000 dental X-rays with 1.3 million detailed instructions and questions about what to look for in the images. They also built a benchmark, MMOral-Bench, to rigorously test AI models on key dental diagnostic skills. To demonstrate the dataset's usefulness, they then fine-tuned an existing model, Qwen2.5-VL-7B, using MMOral, creating a new model called OralGPT, which showed a significant improvement in performance.
Why it matters?
This work is important because it provides the tools needed to develop AI systems that can assist dentists in diagnosing problems more accurately and efficiently. By focusing on the unique challenges of dental X-rays, this research paves the way for more effective AI applications in dentistry, potentially leading to earlier detection of diseases and better patient care.
Abstract
Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.