Expanding the Action Space of LLMs to Reason Beyond Language
Zhongqi Yue, Weishi Wang, Yundaichuan Zhan, Juncheng Li, Daniel Dahlmeier, Fredrik D. Johansson
2025-10-22
Summary
This paper explores a way to make large language models, which are good at understanding and generating text, better at interacting with the real world and solving complex problems that require more than just language skills.
What's the problem?
Currently, when LLMs need to do something outside of just writing text – like using a calculator or controlling a game – they have to translate those actions *into* text, which then needs to be interpreted by another program. This is clunky because the LLM is doing too much at once: thinking *and* translating, and it requires a separate program to understand the text-based commands. It limits what the LLM can do and makes it harder to build systems that can handle complex tasks.
What's the solution?
The researchers came up with a system called ExpA (Expanded Action space) that gives the LLM a wider range of actions it can take *directly*, beyond just choosing words. Think of it like giving the LLM buttons to push instead of making it describe pushing the buttons. The LLM can switch between using language and using these direct actions, and a new learning method called EARL (ExpA Reinforcement Learning) helps the LLM learn how to best use these actions to solve problems. EARL uses a technique that imagines what *would* have happened if the LLM had chosen a different action, helping it learn more efficiently.
Why it matters?
This work is important because it allows LLMs to interact with environments more naturally and effectively. By separating the thinking process from the action-taking process, the LLM can focus on reasoning and planning, leading to better performance on tasks that require multiple steps and adapting to new situations. The results show the LLM can even discover efficient solutions to problems, rivaling those designed by humans.
Abstract
Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.