2602.02188v1 Feb 02, 2026 cs.AI

조합 및 제약 환경에서의 추론: 자연어 조합 최적화에 대한 LLM 벤치마킹

Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization

Cong Zhang

Citations: 42

h-index: 2

Chenhao Zhang

Citations: 1

h-index: 1

Yaoxin Wu

Citations: 80

h-index: 4

Yingqian Zhang

Citations: 501

h-index: 11

Jing Chen

Citations: 2,771

h-index: 6

Xia Jiang

Citations: 39

h-index: 3

Jie Gao

Citations: 66

h-index: 5

Chengpeng Hu

Citations: 4

h-index: 1

대규모 언어 모델(LLM)이 수학 및 논리적 추론에서 강력한 성능을 보여주었지만, 엄격한 제약 조건 하에서 고차원 해 공간을 탐색해야 하는 조합 최적화(CO)를 처리하는 능력은 아직 충분히 연구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 엔드투엔드 CO 추론 능력을 평가하는 자연어 조합 최적화(NLCO) 벤치마크를 소개합니다. 이 벤치마크에서 모델은 자연어로 기술된 의사 결정 시나리오가 주어졌을 때, 코드를 작성하거나 외부 솔버를 호출하지 않고 이산적인 해를 도출해야 합니다. NLCO는 43개의 CO 문제를 다루며, 변수 유형, 제약 조건군, 전역 패턴, 목적 함수 클래스로 구성된 4단계 분류 체계를 사용하여 세밀한 평가가 가능합니다. 우리는 솔버(solver)로 검증된 해를 제공하며, 실행 가능성, 해의 최적성, 추론 효율성을 기준으로 LLM을 포괄적으로 평가합니다. 다양한 최신 LLM을 대상으로 한 실험 결과, 고성능 모델들은 작은 인스턴스에서는 강력한 실행 가능성과 해 품질을 달성하지만, 추론에 더 많은 토큰을 사용하더라도 인스턴스 크기가 커짐에 따라 성능이 저하되는 것으로 나타났습니다. 또한 분류 체계 전반에 걸쳐 체계적인 효과가 관찰되었습니다. 집합 기반 작업은 비교적 쉬운 반면, 그래프 구조 문제와 병목(bottleneck) 목적 함수는 더 빈번한 실패를 초래했습니다.

Original Abstract

While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!