HUME: Measuring the Human-Model Performance Gap in Text Embedding Task
Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, Kenneth Enevoldsen
2025-10-14
Summary
This paper investigates how well current computer models understand the meaning of text compared to how well humans do, focusing on what are called 'text embeddings' which represent words and sentences as numbers. It highlights that we don't often directly compare model performance to human performance in this area.
What's the problem?
Evaluating how good these text embedding models are is usually done by comparing them to each other, but there's a missing piece: we don't have a good way to measure how *humans* would perform on the same tasks. Without knowing what a human could reasonably achieve, it's hard to tell if a model's score is actually good or just reflects a difficult task. Existing benchmarks don't provide this human baseline.
What's the solution?
The researchers created a new framework called HUME (Human Evaluation Framework for Text Embeddings) to measure human performance on a variety of text embedding tasks. They tested people on 16 different datasets, covering things like re-ranking search results, classifying text, grouping similar texts, and determining how similar two pieces of text are. They then compared human scores to the scores of the best embedding models, across many different languages.
Why it matters?
This work is important because it gives us a realistic benchmark for evaluating text embedding models. It shows that while models are often close to human-level performance, they sometimes struggle, especially with languages that don't have a lot of online resources. By understanding where models succeed and fail compared to humans, we can improve both the models themselves and the tests we use to evaluate them, ultimately leading to better AI that understands language more like we do.
Abstract
Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, although variation is substantial: models reach near-ceiling performance on some datasets while struggling on others, suggesting dataset issues and revealing shortcomings in low-resource languages. We provide human performance baselines, insight into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of the model and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.