BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Xing Han Lù

2024-07-10

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Summary

This paper talks about BM25S, a new and efficient system for searching through text documents quickly. It is designed to be much faster than existing methods by using a special technique called eager sparse scoring.

What's the problem?

The main problem is that traditional methods for searching through large amounts of text can be very slow and inefficient, especially when trying to find relevant documents based on specific keywords. Existing frameworks often take a long time to compute the relevance of documents, which can be frustrating for users who need quick results.

What's the solution?

To solve this issue, the authors developed BM25S, which uses a Python-based implementation that relies only on Numpy and Scipy. This new system can compute relevance scores for documents much faster—up to 500 times quicker than the most popular existing Python frameworks. BM25S does this by calculating scores while indexing the documents and storing them in a way that requires less memory. Additionally, it includes improvements that allow it to handle different types of scoring methods effectively.

Why it matters?

This research is important because it significantly enhances the speed and efficiency of text searching, making it easier for users to find relevant information quickly. This improvement can benefit various applications, such as search engines, data analysis tools, and any system that requires fast document retrieval. As the amount of digital information continues to grow, having faster search capabilities will be crucial for effective information access.

Abstract

We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

View Paper