O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang

2025-01-14

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

Summary

This paper talks about making AI better at medical reasoning by giving it more time to think. The researchers tested this idea on different medical tasks and found that it really helps the AI make better decisions, especially for tricky medical problems.

What's the problem?

AI models are getting good at many things, but they still struggle with complex medical tasks like diagnosing diseases or planning treatments. These tasks often require careful thinking and considering many different factors, which can be hard for AI to do quickly.

What's the solution?

The researchers tried a method called 'inference-time scaling.' This basically means giving the AI more time to work on a problem. They tested this on various medical tasks of different difficulty levels. They found that when they gave the AI more time, it did much better, improving its performance by 6-11%. They also noticed that harder problems needed more thinking time, just like how a doctor might need more time for a complicated case. The AI even learned to make lists of possible diagnoses and narrow them down, similar to how real doctors think.

Why it matters?

This research matters because it could help make AI a more useful tool for doctors. If AI can think more like a human doctor and explain its reasoning, it could help diagnose diseases more accurately or suggest better treatments. This could be especially helpful for complicated medical cases where even experienced doctors might need extra help. In the future, this kind of AI could work alongside doctors to improve patient care and maybe even save lives.

Abstract

Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.

View Paper