Gistify! Codebase-Level Understanding via Runtime Execution

Hyunji Lee,Minseon Kim,Chinmay Singh,Matheus Pereira,Atharv Sonwane,Isadora White,Elias Stengel-Eskin,Mohit Bansal,Zhengyan Shi,Alessandro Sordoni,Marc-Alexandre Cot'e,Xingdi Yuan,Lucas Caccia

Published 2025 in arXiv.org

ABSTRACT

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-10-30
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2510.26790 arXiv 2510.26790
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

debug-gym: A Text-Based Environment for Interactive Debugging
2025cited by this paper
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
2025cited by this paper
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code
2025cited by this paper
CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering
2025cited by this paper
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing
2025cited by this paper
CoCoNUT: Structural Code Understanding does not fall out of a tree
2025cited by this paper
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging
2025cited by this paper
CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
2025cited by this paper
CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
2025cited by this paper
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
2025cited by this paper
RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving
2025cited by this paper
R2E: Turning any Github Repository into a Programming Agent Environment
2024cited by this paper
On Improving Repository-Level Code QA for Large Language Models
2024cited by this paper
SWE-Bench+: Enhanced Coding Benchmark for LLMs
2024cited by this paper
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph
2024cited by this paper
CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering
2024cited by this paper
SelfPiCo: Self-Guided Partial Code Execution with LLMs
2024cited by this paper
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
2024cited by this paper
RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation
2024cited by this paper
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step
2024cited by this paper
Reasoning Runtime Behavior of a Program with LLM: How Far are We?
2024cited by this paper
CodeS: Natural Language to Code Repository via Multi-Layer Sketch
2024cited by this paper
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories
2024cited by this paper
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
2024cited by this paper
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
2024cited by this paper
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
2024cited by this paper
CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases
2024cited by this paper
Code Execution with Pre-trained Language Models
2023cited by this paper
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
2023cited by this paper
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
2023cited by this paper
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
2023cited by this paper
CodePlan: Repository-Level Coding using LLMs and Planning
2023cited by this paper
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
2023cited by this paper
RepoFusion: Training Code Models to Understand Your Repository
2023cited by this paper
TRACED: Execution-Aware Pre-Training for Source Code
2023cited by this paper
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
2023cited by this paper
CodeQueries: A Dataset of Semantic Queries over Code
2022cited by this paper
GitHub Copilot
2022cited by this paper

CITED BY

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents
2025cites this paper