A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Florin Cuconasu, Giovanni Trappolini, Nicola Tonellotto, Fabrizio Silvestri

2024-06-24

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

Summary

This paper explores how different types of large language models (LLMs) perform in a system called Retrieval Augmented Generation (RAG), which combines retrieving information and generating responses. It finds that base models, which are not specially trained to follow instructions, actually perform better than instructed models in RAG tasks.

What's the problem?

Many people believe that instructed LLMs, which are fine-tuned to follow specific instructions, are always better for tasks that require retrieving and generating information. However, this assumption may not be true, as it can lead to misunderstandings about the capabilities of different types of models.

What's the solution?

The researchers conducted experiments comparing base LLMs and instructed LLMs in RAG tasks. They discovered that base models outperformed instructed ones by an average of 20%. This result suggests that relying solely on instructed models might not be the best approach for every situation. The study encourages further investigation into how these models work and their effectiveness in various applications.

Why it matters?

This research is important because it challenges common beliefs about model training and performance in AI systems. By highlighting the strengths of base models, it opens the door for more effective use of LLMs in RAG applications, potentially leading to improved accuracy and reliability in AI-generated responses.

Abstract

Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".

View Paper