2601.07411v1 Jan 12, 2026 cs.LG

SCALPEL: 저랭크 파라미터 편집을 통한 선택적 기능 제거를 이용한 대규모 언어 모델 해석 분석

SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis

Zihao Fu

Citations: 27

h-index: 3

Xufeng Duan

Citations: 114

h-index: 8

Zhenguang G. Cai

Citations: 126

h-index: 8

대규모 언어 모델은 다양한 분야에서 뛰어난 성능을 보이지만, 내부 메커니즘에 대한 이해 부족으로 인해 의료, 법률 시스템, 자율 의사 결정 등과 같은 분야에서의 활용은 제한적입니다. 이러한 모델들이 중요한 시스템에 통합됨에 따라, 모델이 어떻게 기능을 인코딩하는지를 이해하는 것은 해석 가능성 연구에 있어 매우 중요해졌습니다. 기존의 접근 방식은 기울기 기여도 또는 활성화 분석을 통해 중요한 모듈을 식별하며, 특정 기능이 특정 구성 요소에 매핑된다고 가정합니다. 그러나 이러한 접근 방식은 신경망 계산을 지나치게 단순화합니다. 모듈은 여러 기능을 동시에 기여할 수 있으며, 하나의 기능은 여러 모듈에 분산될 수 있습니다. 이러한 거칠고 포괄적인 분석은 미세하고 분산된 기능 인코딩을 포착하지 못합니다. 본 논문에서는 SCALPEL(Selective Capability Ablation via Low-rank Parameter Editing for Large language models)이라는 프레임워크를 제시합니다. SCALPEL은 기능을 개별 모듈이 아닌 저랭크 파라미터 부분 공간으로 표현합니다. 핵심적인 통찰력은 기능이 레이어와 모듈 전체에 분산된 저랭크 수정으로 특징지어질 수 있으며, 이를 통해 다른 기능에 영향을 주지 않고 특정 기능을 정확하게 제거할 수 있다는 것입니다. SCALPEL은 LoRA 어댑터를 훈련하여 정답과 오답을 구별하는 능력을 줄이면서 일반적인 언어 모델링 품질을 유지함으로써, 특정 기능에 책임이 있는 저랭크 표현을 식별하고 다른 기능과 분리합니다. BLiMP 데이터셋을 사용한 다양한 기능 및 언어 관련 작업 실험 결과, SCALPEL은 대상 기능을 성공적으로 제거하면서도 일반적인 기능을 유지하여 파라미터 공간에 걸쳐 기능이 어떻게 분포하는지에 대한 미세한 통찰력을 제공합니다. 실험 결과는 기능이 저랭크 구조를 가지며, 대상 파라미터 공간에 대한 특정 개입을 통해 선택적으로 제거될 수 있음을 보여주며, 이는 LLM에서 기능 인코딩에 대한 세밀한 이해를 제공합니다.

Original Abstract

Large language models excel across diverse domains, yet their deployment in healthcare, legal systems, and autonomous decision-making remains limited by incomplete understanding of their internal mechanisms. As these models integrate into high-stakes systems, understanding how they encode capabilities has become fundamental to interpretability research. Traditional approaches identify important modules through gradient attribution or activation analysis, assuming specific capabilities map to specific components. However, this oversimplifies neural computation: modules may contribute to multiple capabilities simultaneously, while single capabilities may distribute across multiple modules. These coarse-grained analyses fail to capture fine-grained, distributed capability encoding. We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework representing capabilities as low-rank parameter subspaces rather than discrete modules. Our key insight is that capabilities can be characterized by low-rank modifications distributed across layers and modules, enabling precise capability removal without affecting others. By training LoRA adapters to reduce distinguishing correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies low-rank representations responsible for particular capabilities while remaining disentangled from others. Experiments across diverse capability and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities while preserving general capabilities, providing fine-grained insights into capability distribution across parameter space. Results reveal that capabilities exhibit low-rank structure and can be selectively ablated through targeted parameter-space interventions, offering nuanced understanding of capability encoding in LLMs.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!