F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
2026-03-20
Summary
This paper introduces F2LLM-v2, a new set of computer models designed to understand and represent the meaning of text in many different languages.
What's the problem?
Existing models that convert text into a numerical format (called embeddings) often don't work well with languages that aren't widely used online, and they can be very large and require a lot of computing power. This limits their usefulness for many applications and researchers.
What's the solution?
The researchers created F2LLM-v2 using a clever combination of techniques. They trained the models on a huge collection of text data in over 200 languages, focusing on those less represented online. They also used a two-step training process, a method called 'matryoshka learning' to make smaller, more efficient models, and 'knowledge distillation' to transfer knowledge from larger models to smaller ones. This resulted in models that are both accurate and relatively small.
Why it matters?
F2LLM-v2 is important because it provides high-quality text understanding for a much wider range of languages than previous models, especially those that haven't been well-supported before. The smaller models are also more practical for use on devices with limited resources, like phones or laptops, and the researchers are making all their work freely available to encourage further research in this area.
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.