2603.07886v1 Mar 09, 2026 cs.CL

CCR-Bench: 복잡한 제약 조건, 제어 흐름 및 실제 사례에 대한 LLM 평가를 위한 종합적인 벤치마크

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Yiqiao Huang

Citations: 3

h-index: 1

Fanyu Meng

Citations: 138

h-index: 3

Jiachen Li

Citations: 48

h-index: 2

Chao Deng

Citations: 20

h-index: 2

Rui Liu

Citations: 25

h-index: 2

Xiaona Xue

Citations: 1

h-index: 1

Yuanhang Zheng

Citations: 103

h-index: 6

Hui Miao

Citations: 28

h-index: 1

Yunfei Ma

Citations: 30

h-index: 2

Xin Sun

Citations: 5

h-index: 1

Minglu Liu

Citations: 46

h-index: 2

Junlan Feng

Citations: 26

h-index: 2

대규모 언어 모델(LLM)이 복잡한 지침을 따르는 능력은 실제 응용 분야에 배포하는 데 매우 중요합니다. 그러나 기존 평가 방법은 종종 지침의 복잡성을 단순한 원자적 제약 조건의 가산적 조합으로 간주하여, 콘텐츠와 형식, 논리적 워크플로우 제어 및 실제 응용 프로그램 간의 복잡한 상호 작용에서 발생하는 고차원적인 복잡성을 충분히 반영하지 못합니다. 이는 현재의 평가 방식과 실제 요구 사항 간의 상당한 격차를 초래합니다. 이러한 격차를 해소하기 위해, 우리는 LLM의 복잡한 지침 준수도를 평가하도록 설계된 새로운 벤치마크인 CCR-Bench를 소개합니다. CCR-Bench는 다음과 같은 특징을 갖습니다. (1) 작업 사양에서 콘텐츠와 형식 요구 사항의 깊은 연관성; (2) 복잡한 작업 분해, 조건부 추론 및 절차적 계획을 포함하는 지침; (3) 실제 산업 시나리오에서 완전히 파생된 평가 샘플. CCR-Bench에 대한 광범위한 실험 결과, 최첨단 모델조차 상당한 성능 저하를 보이며, 이는 현재 LLM의 기능과 실제 지침 이해 요구 사항 간의 격차를 명확하게 보여줍니다. 우리는 CCR-Bench가 보다 엄격하고 현실적인 평가 프레임워크를 제공하며, 이를 통해 LLM 개발이 산업 응용 분야에서 복잡한 작업을 이해하고 실행할 수 있는 차세대 모델로 발전할 수 있을 것이라고 믿습니다.

Original Abstract

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!