GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

Omer Nacar, Anis Koubaa, Serry Sibaee, Yasser Al-Habashi, Adel Ammar, Wadii Boulila

2025-06-02

GATE: General Arabic Text Embedding for Enhanced Semantic Textual
Similarity with Matryoshka Representation Learning and Hybrid Loss Training

Summary

This paper talks about GATE, a new way for computers to better understand the meaning of Arabic texts by using advanced techniques to compare how similar two pieces of writing are.

What's the problem?

The problem is that most language models and tools are designed mainly for English or other widely spoken languages, so they don't do as well when trying to figure out if two Arabic texts mean the same thing or are closely related.

What's the solution?

The researchers created GATE, which uses a special method called Matryoshka Representation Learning and a hybrid loss training strategy. These techniques help the model learn deeper and more accurate representations of Arabic text, allowing it to compare meanings much more effectively and achieve top results on tests for Arabic text similarity.

Why it matters?

This is important because it helps improve search engines, translation tools, and other AI systems for Arabic speakers, making it easier for people to find information, communicate, and use technology in their own language.

Abstract

GATE models using Matryoshka Representation Learning and a hybrid loss approach achieve state-of-the-art performance on Arabic Semantic Textual Similarity benchmarks.

View Paper