From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond
Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz
2024-11-07

Summary
This paper explores the use of advanced strategies to improve how large language models (LLMs) perform on medical tasks, focusing on a new model called o1-preview and comparing it to previous methods like Medprompt.
What's the problem?
Large language models often struggle with complex tasks, especially in specialized fields like medicine, where accurate and reliable answers are crucial. Traditional methods for improving these models can be limited and may not always yield the best results.
What's the solution?
The authors introduce o1-preview, a new model that incorporates run-time reasoning capabilities, allowing it to think through problems before generating answers. They evaluate this model on various medical benchmarks and find that it performs significantly better than previous models, even without additional prompting techniques. They also analyze how different prompting strategies affect performance and discover that some traditional methods may not work well with this new model.
Why it matters?
This research is important because it represents a step forward in developing AI systems that can assist in medical decision-making. By improving the accuracy and reliability of LLMs in healthcare, these advancements could lead to better patient care and more effective use of AI in clinical settings.
Abstract
Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.