HippoCamp: Benchmarking Contextual Agents on Personal Computers

Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

2026-04-02

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Summary

This paper introduces HippoCamp, a new way to test how well AI agents can manage and understand files like a person would, focusing on personal data and realistic scenarios.

What's the problem?

Current AI benchmarks mostly test agents on things like browsing the web or using simple tools. They don't really challenge agents to deal with the messiness of a real person's files – lots of different types of documents, pictures, and videos spread across a computer. This means we don't know how well AI can actually help with tasks like finding specific information within a personal digital life, which requires understanding the context of those files and connecting information across different formats.

What's the solution?

The researchers created a large dataset called HippoCamp, containing over 2,000 real-world files totaling 42.4 GB, representing typical user profiles. They then created 581 questions and answers based on these files to test the AI's ability to search, understand evidence, and reason through multiple steps. They also carefully tracked where the AI failed at each step to pinpoint specific weaknesses. They tested several advanced AI models on this benchmark.

Why it matters?

The results showed that even the best AI models struggle with this type of task, only getting about 48% of the questions right. The biggest problems were understanding different types of files and connecting information found in those files to answer questions. HippoCamp highlights that current AI isn't ready to be a truly helpful personal assistant and provides a better way to develop and test AI for these kinds of real-world applications.

Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

View Paper