2604.12290v1 Apr 14, 2026 cs.AI

Frontier-Eng: 생성적 최적화를 활용한 실세계 엔지니어링 작업에서 자체 진화하는 에이전트의 성능 평가

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Bingxiang He

Citations: 1,018

h-index: 9

Han Hao

Citations: 5

h-index: 1

Situ Wang

Citations: 12

h-index: 3

Y. Chi

Citations: 0

h-index: 0

Deyao Hong

Citations: 0

h-index: 0

Tian-Yuan Luo

Citations: 356

h-index: 3

Boshi Zhang

Citations: 0

h-index: 0

Dianqiao Lei

Citations: 0

h-index: 0

Qingle Liu

Citations: 0

h-index: 0

Houde Qian

Citations: 26

h-index: 1

Youjie Zheng

Citations: 35

h-index: 3

Yifan Zhou

Citations: 9

h-index: 1

E. Cai

Citations: 2

h-index: 1

Qinhuai Na

Citations: 0

h-index: 0

Dapeng Jiang

Citations: 8

h-index: 2

Kaisen Yang

Citations: 4

h-index: 1

Zhengjun Cao

Citations: 0

h-index: 0

Xiaoyan Fan

Citations: 1

h-index: 1

Weiyang Jin

Citations: 12

h-index: 2

Bowen Wang

Citations: 11

h-index: 2

C. Xiao

Citations: 0

h-index: 0

현재 LLM 에이전트 벤치마크는 주로 코드 생성 또는 검색 기반 질의 응답과 같이 이분법적인 성공/실패 여부를 평가하는 작업에 집중하는 경향이 있으며, 실질적인 설계 최적화 과정을 간과하는 경우가 많습니다. 이에, 우리는 생성적 최적화를 위한 인간 검증 벤치마크인 Frontier-Eng을 소개합니다. Frontier-Eng은 에이전트가 후보 솔루션을 생성하고, 실행 가능한 검증 피드백을 받고, 제한된 상호 작용 예산 내에서 이를 수정하는 반복적인 제안-실행-평가 루프를 포함하며, 5가지 주요 엔지니어링 분야에 걸쳐 47개의 작업으로 구성됩니다. 이전 벤치마크와 달리, Frontier-Eng 작업은 산업 수준의 시뮬레이터 및 검증 도구를 기반으로 하며, 지속적인 보상 신호를 제공하고 제한된 예산 내에서 엄격한 제약 조건을 적용합니다. 대표적인 검색 프레임워크를 사용하여 8개의 최첨단 언어 모델을 평가한 결과, Claude 4.6 Opus가 가장 뛰어난 성능을 보였지만, 모든 모델에게 여전히 어려운 과제로 남아있습니다. 분석 결과, 개선 빈도는 약 1/iteration, 개선량은 약 1/improvement count로 감소하는 경향을 보입니다. 또한, 너비는 병렬성과 다양성을 향상시키지만, 제한된 예산 하에서 중요한 개선을 달성하기 위해서는 깊이가 여전히 필수적임을 보여줍니다. Frontier-Eng은 AI 에이전트가 도메인 지식을 실행 가능한 피드백과 통합하여 복잡하고 개방적인 엔지니어링 문제를 해결하는 능력을 평가하는 새로운 기준을 제시합니다.

Original Abstract

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!