CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
2025-09-15
Summary
This paper focuses on the difficulty of creating computer programs that can automatically generate headlines for news articles in languages spoken by minority groups in China, like Tibetan, Uyghur, and Mongolian.
What's the problem?
Developing good headline generation tools requires a lot of example headlines and articles to 'train' the computer program. However, these minority languages don't have many digital resources available, especially large collections of text data (called corpora) that are needed for this kind of task. Because their writing systems are different from commonly used ones, it's hard to build these resources, making it difficult to create effective headline generators for these languages.
What's the solution?
The researchers created a new dataset called Chinese Minority Headline Generation (CMHG). This dataset includes 100,000 examples for Tibetan, and 50,000 examples each for Uyghur and Mongolian, all specifically designed for training headline generation programs. They also created a separate, high-quality set of examples checked by native speakers to test how well these programs perform.
Why it matters?
This new dataset is important because it provides a crucial resource for researchers who want to develop better language technology for these minority languages. It will allow them to build and test headline generation tools, which can help preserve and promote these languages in the digital world and create benchmarks for future improvements.
Abstract
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.