TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

Kaijie Zhu,Yuzhou Nie,Yijiang Li,Yiming Huang,Jialian Wu,Jiang Liu,Ximeng Sun,Zhenfei Yin,Lun Wang,Zicheng Liu,E. Barsoum,W. Wang,Wenbo Guo

Published 2026 in Unknown venue

ABSTRACT

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-06
Fields of study
Computer Science
Identifiers
arXiv 2602.07274
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
2026cited by this paper
Scaling Agent Learning via Experience Synthesis
2025cited by this paper
SWE-smith: Scaling Data for Software Engineering Agents
2025cited by this paper
LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
2025cited by this paper
Kimi K2: Open Agentic Intelligence
2025cited by this paper
Qwen3 Technical Report
2025influential reference
Simulating Environments with Reasoning Models for Agent Training
2025cited by this paper
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
2024cited by this paper
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
2024cited by this paper
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
2024cited by this paper
AgentInstruct: Toward Generative Teaching with Agentic Flows
2024cited by this paper
Large Language Model-Based Agents for Software Engineering: A Survey
2024cited by this paper
Qwen2.5-Coder Technical Report
2024cited by this paper
Self-Refine: Iterative Refinement with Self-Feedback
2023cited by this paper
Reflexion: language agents with verbal reinforcement learning
2023cited by this paper
FireAct: Toward Language Agent Fine-tuning
2023cited by this paper
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
2023cited by this paper
AgentTuning: Enabling Generalized Agent Abilities for LLMs
2023cited by this paper
Competition-Level Problems are Effective LLM Evaluators
2023cited by this paper
WebArena: A Realistic Web Environment for Building Autonomous Agents
2023cited by this paper
ReAct: Synergizing Reasoning and Acting in Language Models
2022cited by this paper
Training Verifiers to Solve Math Word Problems
2021cited by this paper
Evaluating Large Language Models Trained on Code
2021cited by this paper
Terminus
2017cited by this paper
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
2010cited by this paper

CITED BY

No citing papers are available for this paper.