2602.09464v1 Feb 10, 2026 cs.SE

AlgoVeri: 고전 알고리즘 기반 검증된 코드 생성에 대한 통합 벤치마크

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

V. Veeravalli

Citations: 13,797

h-index: 58

Haoyu Zhao

Citations: 284

h-index: 5

Ziran Yang

Citations: 123

h-index: 3

Jiawei Li

Citations: 13

h-index: 2

D.M. He

Citations: 8

h-index: 2

Zenan Li

Citations: 214

h-index: 7

C. Jin

Citations: 53

h-index: 4

Aarti Gupta

Citations: 24

h-index: 3

Sanjeev Arora

Citations: 313

h-index: 3

검증 코딩은 엄격한 사양으로부터 형식적으로 검증된 코드를 생성하는 것을 의미합니다. 최근 AI 모델은 검증 코딩 분야에서 잠재력을 보여주고 있지만, 다양한 패러다임을 평가하기 위한 통일된 방법론이 부족합니다. 기존 벤치마크는 개별 언어/도구(예: Dafny, Verus, Lean)만을 테스트하며, 각 벤치마크는 매우 다른 작업을 다루기 때문에 성능 수치를 직접적으로 비교할 수 없습니다. 우리는 AlgoVeri라는 벤치마크를 통해 이 문제를 해결하고자 합니다. AlgoVeri는 Dafny, Verus, Lean에서 77개의 고전 알고리즘에 대한 검증 코딩을 평가합니다. 동일한 기능적 계약을 적용함으로써, AlgoVeri는 검증 시스템의 중요한 기능적 격차를 드러냅니다. 최첨단 모델은 고수준 추상화 및 SMT 자동화가 워크플로우를 단순화하는 Dafny에서 상당한 성공을 거두었습니다 (Gemini-3 Flash의 경우 40.3%). 그러나 Verus의 시스템 수준 메모리 제약 조건 (24.7%)과 Lean에서 요구되는 명시적인 증명 구성 (7.8%)에서는 성능이 현저히 저하됩니다. 전체적인 지표 외에도, 테스트 시간 동안의 계산 동역학에서 뚜렷한 차이가 나타납니다. Gemini-3는 성능 향상을 위해 반복적인 수정 방법을 효과적으로 활용합니다 (예: Dafny에서 합격률 3배 증가), 반면 GPT-OSS는 초기 단계에서 성능이 정체됩니다. 또한, 오류 분석 결과 언어 설계가 개선 과정을 좌우한다는 것을 알 수 있습니다. Dafny는 모델이 논리적 정확성에 집중할 수 있도록 하는 반면, Verus와 Lean은 모델을 지속적인 구문 및 의미적 장벽에 가두어 버립니다. 모든 데이터 및 평가 코드는 https://github.com/haoyuzhao123/algoveri 에서 확인할 수 있습니다.

Original Abstract

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

4 Citations

0 Influential

49 Altmetric

249.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!