2604.02709v1 Apr 03, 2026 cs.CL

초름스키 계층 구조를 통한 대규모 언어 모델의 형식적 추론 능력 평가

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Xue Jiang

Citations: 1,270

h-index: 11

Jiaru Qian

Citations: 101

h-index: 3

Kechi Zhang

Peking University

Citations: 1,011

h-index: 13

Jia Li

Citations: 356

h-index: 8

Zhi Jin

Citations: 3,489

h-index: 28

Yihong Dong

Peking University

Citations: 1,984

h-index: 20

Xiaoha Jian

Citations: 0

h-index: 0

Xuyuan Guo

Citations: 12

h-index: 2

Ge Li

Citations: 178

h-index: 4

Zhiyuan Fan

Citations: 40

h-index: 4

대규모 언어 모델(LLM)의 형식적 추론 능력은 자동화된 소프트웨어 엔지니어링 발전에 매우 중요합니다. 그러나 기존 LLM 평가 지표는 계산 및 복잡성을 기반으로 한 체계적인 평가가 부족하여, LLM의 형식적 추론 능력에 대한 이해에 중요한 격차가 존재합니다. 따라서, 현재 최고 성능을 보이는 LLM이 계산 이론에서 정의하는 형식 언어의 구조적이고 계층적인 복잡성을 제대로 이해하는지 여부는 아직 명확하지 않습니다. 이러한 문제를 해결하기 위해, 우리는 초름스키 계층 구조의 관점에서 LLM을 체계적으로 평가하는 벤치마크인 ChomskyBench를 소개합니다. 기존 연구에서 신경망에 대한 벡터화된 분류 방식을 사용한 것과는 달리, ChomskyBench는 완전한 초름스키 계층 구조 커버리지, 자연어 기반의 프로세스 추적 평가, 그리고 결정적인 기호 검증을 최초로 결합했습니다. ChomskyBench는 각 수준에서의 능력을 테스트하기 위해 설계된 광범위한 언어 인식 및 생성 작업 모음으로 구성되어 있습니다. 광범위한 실험 결과, 명확한 성능 계층 구조가 나타났으며, 이는 계층 구조의 복잡성 수준과 상관 관계를 보입니다. 분석 결과, 작업 난이도가 증가함에 따라 추론 길이와 성능에 상당한 영향을 미치는 직접적인 관계가 있음을 확인했습니다. 또한, 더 큰 모델과 고급 추론 방법이 상당한 상대적 이점을 제공하지만, 실용적인 신뢰성을 확보하려면 엄청난 계산 비용이 필요하며, 현재의 제한 사항은 절대적인 능력의 한계가 아니라 비효율성에서 비롯된다는 것을 발견했습니다. 시간 복잡성 분석 결과, LLM은 이러한 형식적인 작업에 대해 기존의 알고리즘 프로그램보다 훨씬 덜 효율적인 것으로 나타났습니다. 이러한 결과는 현재 LLM의 실질적인 한계를 명확히 하고, 기존 소프트웨어 도구의 중요성을 강조하며, 더욱 강력한 형식적 추론 능력을 갖춘 미래 LLM 개발을 위한 통찰력을 제공합니다.

Original Abstract

The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

0 Citations

0 Influential

14 Altmetric

70.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!