2604.18177v2 Apr 20, 2026 cs.CL

STaD: LLM의 구성 능력 결손을 식별하기 위한 구조화된 작업 설계

STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

S. Kadhe

Citations: 1,565

h-index: 22

Chad DeLuca

Citations: 34

h-index: 3

Shailja Thakur

Citations: 1,083

h-index: 11

Hima Patel

Citations: 146

h-index: 3

Sungeun An

Citations: 122

h-index: 7

벤치마크는 LLM의 다양한 분야에서의 능력을 이해하기 위한 표준으로 자주 사용됩니다. 그러나 벤치마크 점수는 LLM의 구성 능력 결손과 이를 개선하는 방법에 대한 제한적인 통찰력을 제공합니다. 이러한 약점을 명확하게 하기 위해, 우리는 구조화된 작업 설계(STaD) 프레임워크를 제안합니다. STaD는 스캐폴딩(scaffolding) 개념을 기반으로 벤치마크 작업의 제어된 변형을 생성하며, 이를 통해 단계별로 체계적인 지원을 제공합니다. 개별적인 실패 사례를 검토하는 대신, 이 접근 방식은 모델의 특정 추론 능력 결합 부족을 식별함으로써 모델의 동작을 체계적이고 확장 가능하게 분석할 수 있도록 합니다. LLM을 블랙박스로 취급하여, 다양한 크기의 여섯 모델에 대한 실험을 통해 세 가지 추론 벤치마크에서 여러 가지 실패 지점을 발견하고, 각 모델의 고유하고 뚜렷한 능력 결손을 강조했습니다.

Original Abstract

Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!