The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi

2026-01-14

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Summary

This paper examines how well current AI systems, specifically those that can understand both text and images, perform when used in constantly changing, real-world situations. It points out that most testing focuses on how *good* these systems can be, not how reliably they work when things are unpredictable.

What's the problem?

Existing AI models are really good at specific tasks in controlled settings, but they struggle when faced with a continuous stream of new and varying requests. The paper identifies three main issues: figuring out the order to tackle tasks when priorities shift, actively seeking out the information needed to avoid making things up (hallucinations), and continuously improving from experience as new situations arise. Basically, they aren't very adaptable or reliable in dynamic environments.

What's the solution?

To address this, the researchers created a simulated environment called EvoEnv. This environment acts like a 'trainee' that constantly encounters new tasks and learns over time. EvoEnv tests AI agents on three key abilities: managing tasks as they come in with changing importance, intelligently asking for more information when unsure, and continuously learning and improving from the tasks it's given. They then tested some of the most advanced AI models in this environment.

Why it matters?

This work is important because it shows that even the best AI models have significant weaknesses when deployed in realistic, ever-changing scenarios. It provides a new way to test AI systems, moving beyond simple benchmarks to evaluate their reliability and ability to learn and adapt, which is crucial for using them in real-world applications like automation.

Abstract

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce , a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

View Paper