Large Action Models: From Inception to Implementation
Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
2024-12-16

Summary
This paper talks about Large Action Models (LAMs), which are advanced AI systems designed to perform real-world tasks by understanding and executing actions based on human intentions.
What's the problem?
As AI technology progresses, there is a need for systems that can do more than just understand and generate text. Traditional Large Language Models (LLMs) are great at processing language but lack the ability to take actions in the real world. This limits their usefulness in applications that require active participation, like managing tasks or making decisions.
What's the solution?
The authors propose a framework for developing LAMs that can handle both understanding and action execution. They provide a step-by-step guide on how to create these models, including collecting data, training the model, integrating it into environments, and evaluating its performance. By using a case study based on a Windows operating system agent, they illustrate how LAMs can be built to perform various tasks effectively. The paper also discusses the current limitations of LAMs and suggests future research directions.
Why it matters?
This research is significant because it marks a major step toward creating AI systems that can operate autonomously in real-world situations. By enabling AI to not only understand language but also take meaningful actions, LAMs could transform industries such as healthcare, logistics, and customer service, making processes more efficient and reducing the need for human intervention.
Abstract
As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.