< Explain other AI papers

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

2026-04-30

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

Summary

This research paper investigates how well autonomous agents powered by large language models can reliably manage real money in financial markets. They specifically looked at a system called DX Terminal Pro where users gave instructions, and the agents automatically traded cryptocurrency.

What's the problem?

Building AI agents to handle real finances is tricky because simply having a powerful language model isn't enough. The paper found that these agents often made mistakes that aren't caught by standard testing methods, like inventing their own trading rules, getting stuck on transaction fees, or misinterpreting how different cryptocurrencies work. These errors could lead to significant financial losses for users.

What's the solution?

The researchers didn't just focus on the language model itself. They built a whole system *around* the model with several safeguards. This included carefully constructing the prompts the model receives, using strict controls on what actions the agent can take, validating those actions before they happen, and closely monitoring everything the agent does. They also used extensive testing before launch to identify and fix these kinds of errors, and then made changes to the system to reduce the frequency of these failures.

Why it matters?

This work shows that to create trustworthy AI agents for financial tasks, you need more than just a smart model. You need a robust operating system with built-in safety checks and thorough testing. It highlights the importance of evaluating these agents not just on their reasoning ability, but on their entire process – from understanding the user's request to actually executing the trade and settling the transaction.

Abstract

We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.