2601.18987v3 Jan 26, 2026 cs.CL

LLM과 정지 문제: 프로그램 종료 예측 재검토

LLMs versus the Halting Problem: Revisiting Program Termination Prediction

Oren Sultan

Citations: 106

h-index: 4

Jordi Armengol-Estapé

Citations: 167

h-index: 6

Pascal Kesseli

Citations: 17

h-index: 3

Julien Vanegue

Citations: 19

h-index: 3

Dafna Shahaf

Citations: 2,298

h-index: 26

Yossi Adi

Citations: 80

h-index: 5

Peter O'Hearn

Citations: 53

h-index: 3

프로그램이 종료되는지 여부를 결정하는 것은 컴퓨터 과학의 핵심적인 문제입니다. 튜링의 획기적인 연구는 정지 문제를 결정 불가능한 문제로 확립했으며, 이는 모든 프로그램과 입력에 대해 종료 여부를 보편적으로 결정할 수 있는 알고리즘은 존재하지 않음을 보여줍니다. 결과적으로, 자동 검증 도구는 종료 여부를 근사적으로 판단하지만, 때로는 증명하거나 반증하는 데 실패합니다. 이러한 도구는 문제에 특화된 아키텍처와 추상화를 사용하며, 일반적으로 특정 프로그래밍 언어에 종속됩니다. 최근 대규모 언어 모델(LLM)의 성공과 발전은 다음과 같은 질문을 제기합니다. LLM은 프로그램 종료 여부를 신뢰성 있게 예측할 수 있는가? 본 연구에서는 국제 소프트웨어 검증 대회(SV-Comp) 2025의 Termination 카테고리에 속하는 다양한 C 프로그램에 대해 LLM을 평가했습니다. 우리의 결과는 LLM이 프로그램 종료 여부를 예측하는 데 놀라운 성능을 보임을 보여주며, GPT-5 및 Claude Sonnet-4.5는 테스트 시간 스케일링을 사용한 최상위 도구 바로 뒤를 이으며, Code World Model (CWM)은 두 번째 순위의 도구 바로 뒤에 위치합니다. LLM은 프로그램 종료 여부 예측에 효과적이지만, 종종 유효한 증거를 제공하지 못하는 경우가 있습니다. 또한, 프로그램 길이가 증가함에 따라 LLM의 성능이 저하됩니다. 이러한 연구 결과가 프로그램 종료 문제에 대한 추가 연구를 촉진하고, LLM이 결정 불가능한 문제에 대한 추론 능력을 갖도록 하는 데 기여하기를 바랍니다.

Original Abstract

Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

3 Citations

0 Influential

13 Altmetric

68.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!