Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, Nick Barnes

2025-12-08

Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

Summary

This research introduces Colon-X, a project focused on improving how computers understand and help with colonoscopies using both images and text, aiming to move beyond simple identification to actual clinical reasoning.

What's the problem?

Currently, even the most advanced AI models aren't very reliable when it comes to analyzing colonoscopy images and answering questions about them, especially when faced with slight changes or uncertainties. They can identify things, but they struggle to truly *understand* what they're seeing and make sound clinical judgments like a doctor would. There's a lack of good datasets specifically designed to test and train this kind of reasoning ability in colonoscopy.

What's the solution?

The researchers created two new datasets: ColonVQA, a huge collection of questions and answers about colonoscopy images, and ColonReason, a dataset focused on clinical reasoning, built using input from multiple medical experts. They then developed a new AI model called ColonR1, specifically designed for this reasoning task, using a technique that helps it learn even with limited data. ColonR1 significantly outperformed standard training methods in accuracy.

Why it matters?

This work is important because it provides the tools and a starting point for building AI systems that can assist doctors during colonoscopies, potentially improving the accuracy of diagnoses and making the process more efficient. By making the datasets and model publicly available, they're encouraging further research and development in this critical area of healthcare.

Abstract

In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

View Paper