jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao

2024-09-17

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Summary

This paper introduces jina-embeddings-v3, a new model for creating text embeddings that can understand and process multiple languages effectively, while also handling long pieces of text.

What's the problem?

Many existing models struggle with multilingual data and long-context retrieval tasks, meaning they can't efficiently understand or generate responses for texts that are longer than usual or in different languages. This limits their usefulness in real-world applications where diverse languages and long documents are common.

What's the solution?

Jina-embeddings-v3 uses a technique called Low-Rank Adaptation (LoRA) to fine-tune its performance for specific tasks like retrieving documents, classifying text, and matching queries. It has 570 million parameters, allowing it to support up to 8192 tokens of text. The model also incorporates a training method called Matryoshka Representation Learning, which lets it adjust the size of the embeddings without losing quality. This makes it flexible and efficient for various language tasks.

Why it matters?

This research is important because it significantly improves how machines understand and work with multiple languages and lengthy texts. By outperforming other models in both English and multilingual tasks, jina-embeddings-v3 can enhance applications like translation services, search engines, and any software that needs to handle diverse language inputs effectively.

Abstract

We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

View Paper