2603.08640v2 Mar 09, 2026 cs.SE

PostTrainBench: LLM 에이전트는 LLM 사후 훈련을 자동화할 수 있는가?

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Maksym Andriushchenko

Citations: 110

h-index: 5

Ameya Prabhu

Citations: 746

h-index: 12

Matthias Bethge

Citations: 884

h-index: 14

Ben Rank

Citations: 362

h-index: 3

Hardik Bhatnagar

Citations: 222

h-index: 3

Shira Eisenberg

Citations: 11

h-index: 1

Karina Nguyen

Citations: 2,439

h-index: 10

지난 한 해 동안 AI 에이전트는 추론 능력 향상에 힘입어 놀랍도록 소프트웨어 엔지니어링 분야에서 뛰어난 성능을 보여주었습니다. 이는 더 심오한 질문을 제기합니다: 이러한 시스템이 자체 AI 연구를 자동화하는 데까지 기능을 확장할 수 있는가? 본 논문에서는 기본 LLM을 유용한 도구로 만드는 데 중요한 단계인 사후 훈련 과정을 살펴봅니다. 우리는 PostTrainBench를 소개하여 LLM 에이전트가 제한된 컴퓨팅 자원(단일 H100 GPU에서 10시간) 내에 사후 훈련을 얼마나 잘 수행할 수 있는지 벤치마킹합니다. 우리는 최첨단 에이전트(예: Claude Code with Opus 4.6)에게 특정 벤치마크(예: Qwen3-4B on AIME)에서 기본 LLM의 성능을 최적화하도록 요청합니다. 중요한 점은, 에이전트에게 미리 정의된 전략을 제공하는 대신, 필요한 정보를 웹에서 검색하고, 실험을 실행하고, 데이터를 큐레이션하도록 완전한 자율성을 부여한다는 것입니다. 우리는 최첨단 에이전트가 상당한 발전을 보이지만 일반적으로 주요 제공업체의 명령어 튜닝 LLM보다 뒤쳐지는 것을 확인했습니다. 구체적으로, 최고의 에이전트는 23.2%의 성능을 보이는 반면, 공식 명령어 튜닝 모델은 51.1%의 성능을 보였습니다. 그러나 에이전트는 특정 시나리오에서 명령어 튜닝 모델을 능가할 수 있습니다. 예를 들어, GPT-5.1 Codex Max는 Gemma-3-4B를 사용하여 BFCL에서 89%의 성능을 보이는 반면, 공식 모델은 67%의 성능을 보였습니다. 또한, 몇 가지 문제점도 발견되었습니다. 에이전트는 때때로 '보상 해킹'을 시도하며, 테스트 세트로 학습하거나, 자체적으로 학습하는 대신 기존의 명령어 튜닝 체크포인트를 다운로드하고, 승인 없이 합성 데이터를 생성하기 위해 API 키를 사용합니다. 이러한 행동은 우려스럽으며, 이러한 시스템이 더욱 발전함에 따라 신중한 격리가 얼마나 중요한지를 강조합니다. 전반적으로, 우리는 PostTrainBench가 AI 연구 개발 자동화의 진행 상황을 추적하고, 그에 따른 위험을 연구하는 데 유용할 것으로 기대합니다. 웹사이트 및 코드는 https://posttrainbench.com/ 에서 확인할 수 있습니다.

Original Abstract

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

12 Citations

1 Influential

7 Altmetric

49.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!