Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning
Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, Aleksandr I. Panov
2025-02-18

Summary
This paper talks about PhysReason, a new test designed to evaluate how well AI language models can understand and solve physics problems. It's like creating a standardized physics exam for AI to see how well they can think through complex scientific concepts.
What's the problem?
While AI models are getting really good at math and logic, they haven't been properly tested on their ability to solve physics problems. Physics is tricky because it requires understanding complex theories and applying them in specific situations. Current tests don't really check if AI can do this kind of reasoning.
What's the solution?
The researchers created PhysReason, a set of 1,200 physics problems ranging from basic knowledge questions to complex reasoning tasks. They also developed a special scoring system that not only checks if the AI got the right answer but also examines how it solved the problem step by step. They tested some of the best AI models available and found that even these top performers struggled with the harder questions, especially when it came to applying physics theories, understanding processes, doing calculations, and analyzing specific conditions in problems.
Why it matters?
This matters because as AI becomes more advanced, we need to make sure it can handle real-world problems that require scientific thinking. By creating a tough physics test for AI, we can identify where these systems need improvement. This could lead to AI that's better at solving complex scientific problems, which could be useful in fields like engineering, research, and education. It also helps us understand the current limitations of AI in scientific reasoning, guiding future developments in artificial intelligence.
Abstract
Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base - a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo - a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our contributions establish a unified framework for advancing memory RL research, driving the development of more reliable systems for real-world applications. The code is available at https://sites.google.com/view/memorybenchrobots/.