MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

Chengguang Gan, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori

2024-07-16

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

Summary

This paper introduces the Multilingual Mutual Reinforcement Effect Mix (MMM) dataset, which aims to improve how large language models (LLMs) process and understand information across different languages.

What's the problem?

Most existing datasets that help LLMs learn about the mutual reinforcement effect are only available in Japanese, limiting researchers worldwide from exploring this concept fully. This lack of multilingual resources makes it difficult to study how different languages can enhance a model's ability to extract information.

What's the solution?

To tackle this issue, the authors created the MMM dataset, which includes 21 sub-datasets in English, Japanese, and Chinese. They also developed a method to translate existing Japanese datasets into other languages using large language models, which speeds up the process of creating training data. Additionally, they enriched the dataset by adding tasks related to Named Entity Recognition (NER) and sentence classification. This allows for better training of an Open-domain Information Extraction Large Language Model (OIELLM) that can effectively use the new MMM dataset.

Why it matters?

This research is important because it expands access to valuable training data for LLMs in multiple languages, enhancing their ability to understand and extract information from diverse sources. By improving how these models learn from mixed-language datasets, it can lead to better performance in real-world applications like translation services, search engines, and data analysis tools.

Abstract

The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.

View Paper