Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

Clément Christophe, Tathagata Raha, Svetlana Maslenkova, Muhammad Umar Salman, Praveen K Kanithi, Marco AF Pimentel, Shadab Khan

2024-09-24

Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs

Summary

This paper explores the capabilities of a new large language model called o1, focusing on its potential use in medicine. It evaluates how well o1 can understand medical information, reason through complex scenarios, and operate in multiple languages.

What's the problem?

While large language models (LLMs) like o1 have shown great abilities in general tasks, their effectiveness in specialized fields such as medicine is still uncertain. Many existing models are not specifically trained on medical data, which can limit their ability to accurately assist with clinical tasks like diagnosis or treatment recommendations. Additionally, current evaluation methods often overlook important aspects of a model's performance in real-world medical situations.

What's the solution?

To address these challenges, the researchers conducted a comprehensive study using o1 on various medical tasks. They evaluated the model's understanding, reasoning, and multilingual capabilities using 37 medical datasets, including two newly created question-answering tasks based on professional medical quizzes. The study found that o1 outperformed previous models like GPT-4 in accuracy while also identifying areas where it struggled, such as generating incorrect information (hallucinations) and inconsistent performance across languages.

Why it matters?

This research is significant because it helps us understand how effective AI can be in the medical field. By evaluating o1's capabilities, the study paves the way for developing more accurate and reliable AI tools that could assist healthcare professionals in making better decisions, ultimately improving patient care and outcomes.

Abstract

Large Language Models (LLMs) have demonstrated significant potential in transforming clinical applications. In this study, we investigate the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50 billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals the impact of each technique. While continuous pretraining beyond 250 billion tokens yields marginal improvements on its own, it establishes a strong foundation for instruct fine-tuning. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark. Complex prompt engineering methods further enhance performance. These findings show the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM performance in the clinical domain.

View Paper