Evaluating D-MERIT of Partial-annotation on Information Retrieval
Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg
2024-06-25

Summary
This paper discusses D-MERIT, a new dataset created to improve how we evaluate information retrieval systems, like search engines. It focuses on the problems caused by using datasets that only partially annotate relevant information.
What's the problem?
Many retrieval models are tested using datasets where only a few relevant texts are marked as important, while the rest are assumed to be irrelevant. This can lead to unfair evaluations because models that miss these unmarked relevant texts are penalized. Additionally, it's not practical to label every single text for every query, which makes it hard to get accurate evaluations.
What's the solution?
The authors created the D-MERIT dataset, which aims to include all relevant passages for each query from Wikipedia. For example, if a query is about 'journals about linguistics,' the dataset would include all passages that show which journals belong to that category. They found that using datasets with more complete annotations leads to better and fairer evaluations of retrieval systems. Their study recommends finding a balance between being resource-efficient and ensuring reliable evaluations.
Why it matters?
This research is important because it highlights the limitations of current evaluation methods for information retrieval systems. By providing a more comprehensive dataset like D-MERIT, it helps improve how well these systems are tested, leading to better performance in real-world applications like search engines and information retrieval tools.
Abstract
Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., ``journals about linguistics'') and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that Language is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.