CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
Yanlin Feng, Simone Papicchio, Sajjadur Rahman
2024-12-30

Summary
This paper talks about CypherBench, a new framework designed to improve how large language models (LLMs) can retrieve information from complex knowledge graphs, making it easier to access both general and specific information.
What's the problem?
Retrieving information from modern knowledge graphs, like Wikidata, can be difficult for LLMs because these graphs often have complicated structures that exceed the models' ability to process them. Current methods don’t efficiently support retrieval from these large graphs, which limits their usefulness in real-world applications. This makes it hard for LLMs to answer questions accurately when they need to pull information from these extensive datasets.
What's the solution?
To tackle this issue, the authors propose transforming traditional RDF knowledge graphs into smaller, more manageable property graphs that are easier for LLMs to query. They developed a method to convert these complex graphs and introduced CypherBench, which includes 11 large-scale property graphs with millions of entities and thousands of questions. This allows LLMs to use a simplified query language called Cypher to efficiently retrieve information without losing important details.
Why it matters?
This research is important because it enhances the ability of AI systems to access and utilize vast amounts of knowledge stored in complex databases. By making it easier for LLMs to retrieve relevant information, CypherBench can improve the performance of AI in applications like question answering, data analysis, and more, ultimately leading to better decision-making and insights based on comprehensive data.
Abstract
Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.