2603.23994v1 Mar 25, 2026 cs.LG

LLM을 활용한 반복 생성 최적화의 어려움 이해

Understanding the Challenges in Iterative Generative Optimization with LLMs

Adith Swaminathan

Citations: 3,933

h-index: 24

Allen Nie

Citations: 129

h-index: 5

Ching-An Cheng

Citations: 185

h-index: 6

Yucheng Yuan

Citations: 38

h-index: 4

Xavier Daull

Citations: 26

h-index: 2

Zhiyi Kuang

Citations: 245

h-index: 4

Abhinav Akkiraju

Citations: 4

h-index: 1

Anishaa Chaudhuri

Citations: 4

h-index: 1

Max Piasevoli

Citations: 4

h-index: 1

Ryan Rong

Citations: 13

h-index: 3

Prerit Choudhary

Citations: 30

h-index: 2

Shannon Xiao

Citations: 5

h-index: 1

Rasool Fakoor

Citations: 1,068

h-index: 15

생성 최적화는 대규모 언어 모델(LLM)을 사용하여 실행 피드백을 기반으로 아티팩트(예: 코드, 워크플로우 또는 프롬프트)를 반복적으로 개선하는 기술입니다. 이는 자가 개선 에이전트 개발에 유망한 접근 방식이지만, 실제로는 불안정성을 보이는 경향이 있습니다. 활발한 연구에도 불구하고 조사된 에이전트 중 9%만이 자동 최적화를 사용했습니다. 우리는 이러한 불안정성이 발생하는 원인이, 학습 루프를 구축하기 위해 엔지니어가 '숨겨진' 설계 선택을 해야 하기 때문이라고 주장합니다. 구체적으로, 최적화기가 수정할 수 있는 항목과 각 업데이트 시점에 어떤 '올바른' 학습 증거를 제공해야 하는지가 이에 해당합니다. 우리는 대부분의 애플리케이션에 영향을 미치는 세 가지 요인을 조사했습니다. 즉, 시작 아티팩트, 실행 추적의 신뢰 구간, 그리고 시행착오를 학습 증거로 묶는 방식입니다. MLAgentBench, Atari 및 BigBench Extra Hard의 사례 연구를 통해 이러한 설계 결정이 생성 최적화의 성공 여부에 결정적인 영향을 미치지만, 이전 연구에서는 이러한 결정이 명시적으로 다루어지지 않는다는 것을 확인했습니다. 시작 아티팩트는 MLAgentBench에서 도달 가능한 솔루션을 결정하며, 잘린 추적은 Atari 에이전트의 성능을 향상시킬 수 있고, 더 큰 미니배치는 BBEH에서 일반화 성능을 단조적으로 향상시키지 않습니다. 우리는 다양한 도메인에서 학습 루프를 설정하는 간단하고 보편적인 방법이 없다는 것이 생산성 향상과 채택을 위한 주요 장애물이라는 결론을 내립니다. 우리는 이러한 선택을 내리는 데 필요한 실질적인 지침을 제공합니다.

Original Abstract

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

5 Citations

1 Influential

12 Altmetric

67.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!