PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen
2025-10-13
Summary
This paper investigates how well current AI models, specifically those that can process both images and text, actually *understand* physical tools like hammers or screwdrivers, going beyond just recognizing what they are.
What's the problem?
While AI models are getting good at planning and acting in the real world, it's unclear if they truly grasp how tools work – not just what they're called or what they're used for in a general sense, but the underlying physics and principles. Existing tests don't specifically focus on this kind of tool understanding, so it's hard to know if these AI systems can really use tools effectively or if they're just relying on memorized information.
What's the solution?
The researchers created a new test called PhysToolBench. This test uses over 1,000 images paired with questions that assess three levels of tool knowledge: recognizing what a tool is, understanding *how* it works, and even figuring out how to make a tool from other objects if a standard one isn't available. They then tested 32 different AI models on this benchmark to see how they performed.
Why it matters?
This research is important because truly intelligent AI needs to be able to use tools, just like humans do. If AI can't understand tools, it limits its ability to interact with and manipulate the physical world. This work highlights a weakness in current AI models and provides a way to measure progress in improving their understanding of physics and tools, which is crucial for building more versatile and capable AI systems.
Abstract
The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.