OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen

2024-12-18

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Summary

This paper talks about OmniEval, a new evaluation benchmark designed to assess Retrieval-Augmented Generation (RAG) systems specifically in the financial domain, helping to improve their performance and reliability.

What's the problem?

Retrieval-Augmented Generation (RAG) techniques are used to enhance large language models (LLMs) by allowing them to pull in relevant information from external sources. However, evaluating how well these models perform in specific areas, like finance, has been challenging. Current evaluation methods often do not capture all the necessary details, making it hard to understand how effective these models really are in real-world situations.

What's the solution?

OmniEval addresses this problem by providing a comprehensive evaluation framework that includes several innovative features. It categorizes financial queries into different task classes and topics, generates evaluation data using both automated methods and human input, and assesses both the retrieval of information and the generation of responses. This multi-dimensional approach allows for a more thorough evaluation of RAG systems, ensuring they are tested on various scenarios and tasks.

Why it matters?

This research is important because it helps improve the way we evaluate AI systems in specialized fields like finance. By providing a structured and detailed benchmark, OmniEval can lead to better-performing RAG models that are more reliable and effective in generating accurate financial information. This can ultimately benefit industries that rely on precise data analysis and decision-making.

Abstract

As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in https://github.com/RUC-NLPIR/OmniEval{https://github.com/RUC-NLPIR/OmniEval}.

View Paper