2601.10343v2 Jan 15, 2026 cs.CL

OctoBench: 저장소 기반 에이전트 코딩에서 스캐폴드 인식 기반 명령어 준수 성능 평가

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Xuanjing Huang

Citations: 3,341

h-index: 30

Shihan Dou

Citations: 4,276

h-index: 26

Tao Gui

Citations: 8,389

h-index: 42

Qi Zhang

Citations: 1,723

h-index: 24

Qunhong Zeng

Citations: 62

h-index: 4

Deming Ding

Citations: 130

h-index: 2

Shichun Liu

Citations: 821

h-index: 9

Enhui Yang

Citations: 10

h-index: 2

Jiahang Lin

Citations: 122

h-index: 2

Ziying Chen

Citations: 73

h-index: 5

Honglin Guo

Citations: 419

h-index: 9

Weiyu Cheng

Citations: 308

h-index: 3

Pengyu Zhao

Citations: 3

h-index: 1

Chengjun Xiao

Citations: 130

h-index: 2

Qidi Xu

Citations: 323

h-index: 4

최신 코딩 스캐폴드는 LLM을 강력한 소프트웨어 에이전트로 변환하지만, 스캐폴드에서 지정된 지침을 따르는 능력은 아직 충분히 연구되지 않았습니다. 특히 제약 조건이 이질적이고 상호 작용 전반에 걸쳐 지속될 때 더욱 그렇습니다. 이러한 격차를 해소하기 위해, 우리는 저장소 기반 에이전트 코딩에서 스캐폴드 인식 기반 명령어 준수 성능을 평가하는 OctoBench를 소개합니다. OctoBench는 세 가지 유형의 스캐폴드에서 생성된 34개의 환경과 217개의 작업으로 구성되어 있으며, 7,098개의 평가 항목 체크리스트와 함께 제공됩니다. 작업을 해결하는 것과 규칙을 따르는 것을 분리하기 위해, 우리는 전체 실행 경로를 캡처하고 세밀한 검사를 수행하는 자동화된 관찰 및 평가 도구를 제공합니다. 8개의 대표 모델에 대한 실험 결과, 작업 해결 능력과 스캐폴드 인식 기반 준수 능력 사이에 뚜렷한 격차가 있음을 보여주며, 이는 이질적인 명령어 준수를 명시적으로 목표로 하는 훈련 및 평가의 필요성을 강조합니다. 우리는 재현 가능한 성능 평가를 지원하고, 스캐폴드 인식 기반 코딩 에이전트의 개발을 가속화하기 위해 이 벤치마크를 공개합니다.

Original Abstract

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

4 Citations

0 Influential

21 Altmetric

109.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!