Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng

2024-11-12

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Summary

This paper introduces Chinese SimpleQA, a new benchmark designed to evaluate how well large language models (LLMs) can provide factual answers to short questions in Chinese.

What's the problem?

As LLMs rapidly develop, there is a need for effective evaluation methods to measure their ability to answer questions accurately. However, existing benchmarks often lack focus on the Chinese language and do not cover a wide range of topics or provide high-quality data for testing. This makes it difficult to assess how well these models understand and generate factual information in Chinese.

What's the solution?

Chinese SimpleQA addresses these issues by providing a comprehensive dataset that includes 99 diverse subtopics across six major topics in Chinese. The benchmark consists of high-quality, static questions and answers that are easy to evaluate. By focusing on short, factual questions, it allows for straightforward grading using tools like the OpenAI API. This benchmark enables researchers to better understand the factual capabilities of their models and helps improve future development.

Why it matters?

This research is significant because it fills a gap in the evaluation of language models specifically for the Chinese language. By providing a reliable way to assess how well these models can answer factual questions, Chinese SimpleQA can help developers create more accurate and effective AI systems, ultimately leading to better applications in education, customer service, and more.

Abstract

New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.

View Paper