2604.19341v1 Apr 21, 2026 cs.LG

과학적 발견을 위한 평가 기반 확장 방법

Evaluation-driven Scaling for Scientific Discovery

Stefano Ermon

Citations: 89,213

h-index: 102

Xiaowen Chu

Citations: 802

h-index: 17

Yizhen Luo

Citations: 718

h-index: 5

Jingyi Tang

Citations: 11

h-index: 3

Caiyin Yang

Citations: 27

h-index: 3

R. Thapa

Citations: 249

h-index: 8

Ruihua Liu

Citations: 79

h-index: 5

Zeyu Li

Citations: 263

h-index: 5

D. Ding

Citations: 5

h-index: 1

Guangrong He

Citations: 8

h-index: 2

Miao Zhang

Citations: 37

h-index: 3

Lin Sun

Citations: 17

h-index: 3

Wenyang Wang

Citations: 69

h-index: 5

Yuchen Zhong

Citations: 12

h-index: 2

Zhuohao Shen

Citations: 21

h-index: 2

Di He

Citations: 123

h-index: 5

Tongyang Li

Citations: 28

h-index: 3

Yuzhi Xu

Citations: 58

h-index: 4

Haotian Ye

Citations: 38

h-index: 4

Haowei Lin

Citations: 120

h-index: 6

Chang Su

Citations: 21

h-index: 2

Rui Yang

Citations: 16

h-index: 3

Chongming Gao

Citations: 251

h-index: 8

Jianfeng Ma

Citations: 24

h-index: 2

James Z. Wang

Citations: 3

h-index: 1

언어 모델은 가설 생성, 후보 솔루션 제안, 시스템 구현, 그리고 반복적인 개선을 통해 과학적 발견 분야에서 점점 더 많이 활용되고 있습니다. 이러한 시행착오 과정의 핵심은 평가이며, 이는 검증 도구, 시뮬레이터 또는 작업별 평가 함수를 통해 후보 솔루션에 대한 피드백을 얻는 과정입니다. 기존 연구에서는 평가의 중요성이 강조되어 왔지만, 평가 기반 발견 루프를 어떻게 원칙적이고 효과적으로 확장하여 과학적 발견의 경계를 넓힐 수 있는지에 대한 명확한 해결책은 제시되지 않았습니다. 본 논문에서는 이러한 문제를 해결하기 위해, 병렬 탐색, 피드백 기반 개선, 그리고 지역 선택을 전략적으로 결합하는 일반적인 프레임워크인 Simple Test-time Evaluation-driven Scaling (SimpleTES)을 소개합니다. SimpleTES는 평가 기반 발견 루프를 적절한 방향으로 확장함으로써 상당한 성능 향상을 가져옵니다. 21개의 과학적 문제, 6개의 분야에 걸쳐 SimpleTES는 gpt-oss 모델을 사용하여 최첨단 솔루션을 발견했으며, 기존의 최첨단 모델과 정교한 최적화 파이프라인보다 일관되게 뛰어난 성능을 보였습니다. 특히, 널리 사용되는 LASSO 알고리즘의 속도를 2배 이상 향상시키고, 게이트 오버헤드를 24.5% 줄이는 양자 회로 라우팅 정책을 설계했으며, 기존 최고 기록을 뛰어넘는 새로운 Erdos 최소 중복 구조를 발견했습니다. SimpleTES는 새로운 발견 외에도, 피드백 기반 학습을 자연스럽게 감독할 수 있는 경로 수준의 기록을 생성합니다. 성공적인 경로를 기반으로 추가 학습을 수행하면 모델은 기존 문제에 대한 효율성을 향상시킬 뿐만 아니라, 아직 보지 못한 문제에도 일반화되어 기본 모델이 발견하지 못하는 솔루션을 찾아낼 수 있습니다. 종합적으로, 본 연구 결과는 효과적인 평가 기반 루프 확장이 LLM 기반 과학적 발견을 발전시키는 핵심 요소임을 입증하며, 이러한 이점을 실현하기 위한 간단하면서도 실용적인 프레임워크를 제공합니다.

Original Abstract

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

5 Citations

0 Influential

30 Altmetric

155.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!