ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou

2026-04-21

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Summary

This paper introduces a new system called ClawEnvKit that automatically creates environments for testing and training robots with claw-like grippers, and then uses this system to build a large benchmark called Auto-ClawEval.

What's the problem?

Currently, designing environments to test how well these claw robots work is done by people, which takes a lot of time and effort and doesn't easily allow for creating many different scenarios. It's hard to scale up testing because each environment needs to be manually built and checked to make sure it's reasonable and useful.

What's the solution?

The researchers developed ClawEnvKit, a three-part system. First, it takes a description of what the environment should be like written in plain English. Then, it automatically builds the environment based on that description, including how the robot interacts with objects and how success is measured. Finally, it checks the environment to make sure it's possible to complete, diverse from other environments, and makes logical sense. They used this to create Auto-ClawEval, a collection of over a thousand different environments.

Why it matters?

This work is important because it makes it much cheaper and faster to create environments for testing robots. It allows for testing at a scale that wasn't possible before, and even lets users request environments tailored to specific skills they want to evaluate. Furthermore, it can generate training environments that focus on where a robot is struggling, leading to more efficient learning.

Abstract

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

View Paper