On Data Engineering for Scaling LLM Terminal Capabilities
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping
2026-02-25
Summary
This paper focuses on how to best train large language models to effectively use a computer's command line, also known as a 'terminal'. It addresses the fact that the methods used to create the training data for these models are usually kept secret.
What's the problem?
Currently, it's difficult for researchers to improve 'terminal agents' – AI that can interact with computers through text commands – because the data used to train the best performing agents isn't publicly available. This makes it hard to understand what makes these agents work well and to build even better ones. Essentially, we don't know *how* the best agents are learning to use the terminal.
What's the solution?
The researchers created a system called Terminal-Task-Gen to automatically generate a large amount of training data for terminal agents. This system can create tasks based on specific starting points ('seed-based') or desired skills ('skill-based'). They then used this system to build a large, publicly available dataset called Terminal-Corpus. Finally, they trained a new family of models, called Nemotron-Terminal, using this dataset and showed they perform very well on standard tests, even compared to much larger models.
Why it matters?
This work is important because it opens up research in this area. By releasing both the training data and the models, other researchers can now build upon this work and develop even more capable AI agents that can automate tasks and interact with computers in a more intelligent way. It democratizes access to the technology and accelerates progress in the field.
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.