mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, Min Zhang
2024-07-29

Summary
This paper introduces mGTE, a new model designed for retrieving and representing long texts in multiple languages. It focuses on improving how we search for and rank text information across different languages by using advanced techniques.
What's the problem?
Retrieving relevant information from long texts in various languages can be difficult because traditional models often struggle with longer contexts and multilingual data. Many existing systems are limited to short texts and do not effectively handle the complexities of different languages, which can lead to poor search results.
What's the solution?
To solve this problem, the authors developed a long-context multilingual text representation model (TRM) and a reranker from scratch. They created a text encoder that can handle up to 8192 tokens, much longer than previous models that only managed 512 tokens. By using contrastive learning, they combined different models to improve performance. Their evaluations showed that mGTE outperformed existing models, achieving better results on tasks requiring long-context retrieval while being more efficient during training and use.
Why it matters?
This research is important because it enhances the ability to retrieve and understand information in multiple languages, which is crucial in our globalized world. By improving how we process long texts, mGTE can help various applications, including search engines, translation services, and any system that needs to understand complex information across different languages.
Abstract
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.