AutoHarness: improving LLM agents by automatically synthesizing a code harness

Xinghua Lou,Miguel L'azaro-Gredilla,A. Dedieu,C. Wendelken,Wolfgang Lehrach,Kevin P. Murphy

Published 2026 in Unknown venue

ABSTRACT

Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write"harnesses"around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.

PUBLICATION RECORD

Publication year
2026
Venue
Unknown venue
Publication date
2026-02-10
Fields of study
Computer Science
Identifiers
arXiv 2603.03329
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2
2025cited by this paper
Code World Models for General Game Playing
2025cited by this paper
Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
2025cited by this paper
AlphaEvolve: A coding agent for scientific and algorithmic discovery
2025cited by this paper
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning
2025cited by this paper
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
2025cited by this paper
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
2024cited by this paper
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
2024influential reference
LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
2024cited by this paper
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
2023cited by this paper
Eureka: Human-Level Reward Design via Coding Large Language Models
2023cited by this paper
On the Planning Abilities of Large Language Models - A Critical Investigation
2023cited by this paper
Voyager: An Open-Ended Embodied Agent with Large Language Models
2023cited by this paper
Reflexion: language agents with verbal reinforcement learning
2023cited by this paper
Chain of Thought Prompting Elicits Reasoning in Large Language Models
2022cited by this paper
Competition-level code generation with AlphaCode
2022cited by this paper
Code as Policies: Language Model Programs for Embodied Control
2022cited by this paper

CITED BY

No citing papers are available for this paper.