Video-Based Reward Modeling for Computer-Use Agents

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao

2026-03-13

Video-Based Reward Modeling for Computer-Use Agents

Summary

This paper focuses on how to automatically check if a computer program, controlled by an AI agent, is actually doing what the user asked it to do, without needing to understand *how* the program works internally.

What's the problem?

It's really hard to automatically and reliably evaluate if an AI agent successfully completed a task. Current methods often require looking at the agent's thought process, which isn't scalable or applicable to all types of AI. Simply watching a video of the program running doesn't always give enough information because videos can be repetitive and success often depends on small details.

What's the solution?

The researchers created a large dataset called ExeVR-53k, containing 53,000 examples of videos showing a program running, the original user instruction, and whether the task was completed successfully. They also developed a technique to make these videos easier for the AI to analyze by removing unnecessary parts and focusing on important changes in the user interface. Finally, they trained an AI model, ExeVRM, to predict task success just by watching the execution video and reading the user instruction.

Why it matters?

This work is important because it provides a way to automatically evaluate AI agents in a way that doesn't depend on the specific AI being used. The ExeVRM model actually performs better than some of the most advanced AI models from companies like Google and OpenAI at determining if a task was completed correctly, offering a scalable and reliable method for testing and improving AI agents.

Abstract

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

View Paper