PaSa: An LLM Agent for Comprehensive Academic Paper Search

Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E

2025-01-20

PaSa: An LLM Agent for Comprehensive Academic Paper Search

Summary

This paper talks about PaSa, a new AI tool that helps people find academic research papers more effectively than existing search engines or AI assistants. It's like having a super-smart research assistant that can understand complex questions and find the most relevant papers on its own.

What's the problem?

Finding the right academic papers for research can be really hard, even with tools like Google Scholar. It's tough to get accurate results for complex research questions, especially when you need to dig deep into a specific topic. Current search tools often miss important papers or give results that aren't quite what researchers need.

What's the solution?

The researchers created PaSa, an AI that acts like a smart research assistant. PaSa uses advanced language understanding to read through papers, follow citations, and pick out the most relevant information. They trained PaSa using a special dataset called AutoScholarQuery, which has lots of example research questions and matching papers. They also made a test set called RealScholarQuery to see how well PaSa works on real-world questions.

Why it matters?

This matters because it could make academic research much faster and more thorough. PaSa performed better than Google, Google Scholar, and even other AI tools in finding relevant papers. This could help researchers find important information they might have missed before, potentially leading to new discoveries or insights. It's especially useful for complex topics where finding the right information is crucial but time-consuming. By making research easier, PaSa could speed up scientific progress in many fields.

Abstract

We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.

View Paper