2603.28052v1 Mar 30, 2026 cs.AI

메타-하네스: 모델 하네스의 엔드-투-엔드 최적화

Meta-Harness: End-to-End Optimization of Model Harnesses

Kangwook Lee

Citations: 68

h-index: 4

Yoonho Lee

Citations: 171

h-index: 7

Roshen Nair

Citations: 6

h-index: 1

Omar Khattab

Citations: 232

h-index: 5

Chelsea Finn

Citations: 192

h-index: 8

Qizheng Zhang

Stanford University

Citations: 1,233

h-index: 12

대규모 언어 모델(LLM) 시스템의 성능은 모델 가중치뿐만 아니라, 모델에 어떤 정보를 저장, 검색 및 제시할지를 결정하는 코드인 하네스(harness)에 의존합니다. 하지만 현재 하네스는 여전히 대부분 수동으로 설계되며, 기존의 텍스트 최적화 도구는 피드백을 지나치게 공격적으로 압축하여 이 환경에 적합하지 않습니다. 본 논문에서는 LLM 애플리케이션을 위한 하네스 코드를 검색하는 메타-하네스(Meta-Harness)라는 시스템을 소개합니다. 이 시스템은 에이전트 기반의 제안자를 사용하여 파일 시스템을 통해 이전 후보들의 소스 코드, 점수 및 실행 추적 정보를 접근하고 활용합니다. 온라인 텍스트 분류 작업에서 메타-하네스는 최첨단 컨텍스트 관리 시스템보다 7.7점 향상된 성능을 보였으며, 컨텍스트 토큰 사용량은 4배 줄었습니다. 검색 증강 수학 추론 작업에서, 발견된 하나의 하네스는 다섯 개의 검증 모델에서 평균 4.7점의 정확도 향상을 보이며, IMO 수준의 200개 문제에 대한 성능을 개선했습니다. 에이전트 기반 코딩 작업에서는 발견된 하네스가 TerminalBench-2에서 가장 뛰어난 수동 설계 기준을 능가했습니다. 이러한 결과들은 이전 경험에 대한 풍부한 접근성이 자동화된 하네스 엔지니어링을 가능하게 할 수 있음을 보여줍니다.

Original Abstract

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

7 Citations

0 Influential

6 Altmetric

37.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!