Trove: A Flexible Toolkit for Dense Retrieval
Reza Esfandiarpoor, Max Zuo, Stephen H. Bach
2025-11-04
Summary
This paper introduces Trove, a new software toolkit designed to make it easier for researchers to experiment with information retrieval systems, which are the systems that power things like search engines.
What's the problem?
Traditionally, working with the large datasets needed for information retrieval research is really difficult. Researchers have to spend a lot of time preparing and managing data, often creating multiple copies of datasets just to test different configurations. This takes up a lot of computer memory and slows down the research process, and it's hard to customize things without rewriting a lot of code.
What's the solution?
Trove solves this by providing tools to load, filter, and transform data 'on the fly,' meaning as it's needed, instead of storing everything in advance. It's designed to be flexible, allowing researchers to easily modify or replace parts of the system. It also simplifies the process of evaluating how well a retrieval system is working and finding 'hard negatives' – examples that the system struggles with – and can even run these tasks on multiple computers at the same time without needing to change the code. It reduces memory usage and doesn't slow down the actual searching process.
Why it matters?
Trove is important because it lowers the barrier to entry for research in information retrieval. By making data management and experimentation easier, it allows researchers to focus on exploring new ideas and improving search technology, rather than getting bogged down in technical details. It speeds up the research process and encourages more customization and innovation.
Abstract
We introduce Trove, an easy-to-use open-source retrieval toolkit that simplifies research experiments without sacrificing flexibility or speed. For the first time, we introduce efficient data management features that load and process (filter, select, transform, and combine) retrieval datasets on the fly, with just a few lines of code. This gives users the flexibility to easily experiment with different dataset configurations without the need to compute and store multiple copies of large datasets. Trove is highly customizable: in addition to many built-in options, it allows users to freely modify existing components or replace them entirely with user-defined objects. It also provides a low-code and unified pipeline for evaluation and hard negative mining, which supports multi-node execution without any code changes. Trove's data management features reduce memory consumption by a factor of 2.6. Moreover, Trove's easy-to-use inference pipeline incurs no overhead, and inference times decrease linearly with the number of available nodes. Most importantly, we demonstrate how Trove simplifies retrieval experiments and allows for arbitrary customizations, thus facilitating exploratory research.