2601.02978v1 Jan 06, 2026 cs.CL

LLM의 메커니즘적 이해: 희소 오토인코더를 활용한 고차 의미 특징의 검색 및 제어

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Shuo Wang

Citations: 123

h-index: 6

Ruikang Zhang

Citations: 2

h-index: 1

Qi Su

Citations: 10

h-index: 2

최근 메커니즘적 해석(MI) 연구는 대규모 언어 모델(LLM) 내부의 특징을 식별하고 개입하는 데 기여해 왔습니다. 그러나 여전히 과제로 남아있는 것은 이러한 내부 특징을 언어 생성 과정에서 복잡하고 행동 수준의 의미 속성을 안정적으로 제어하는 것과 연결하는 문제입니다. 본 논문에서는 고차 언어적 행동과 관련된 의미적으로 해석 가능한 내부 특징을 검색하고 제어하기 위한 희소 오토인코더 기반 프레임워크를 제안합니다. 저희 방법은 통제된 의미적 반대를 기반으로 하는 대조적인 특징 검색 파이프라인을 사용하며, 통계적 활성화 분석과 생성 기반 검증을 결합하여 희소 활성화 공간에서 단일 의미를 갖는 기능적 특징을 추출합니다. Big Five 성격 특성을 사례 연구로 사용하여, 저희 방법이 기존의 활성화 제어 방법(예: Contrastive Activation Addition, CAA)보다 우수한 안정성과 성능을 유지하면서 모델의 행동을 정밀하고 양방향으로 제어할 수 있음을 보여줍니다. 또한, 저희는 '기능적 충실성(Functional Faithfulness)'이라는 경험적 효과를 발견했습니다. 이는 특정 내부 특징에 개입하면 대상 의미 속성에 부합하는 여러 언어적 차원에서 일관되고 예측 가능한 변화를 유도한다는 것입니다. 저희의 연구 결과는 LLM이 고차 개념의 깊이 통합된 표현을 내재하고 있으며, 이는 복잡한 AI 행동을 규제하기 위한 새로운롭고 강력한 메커니즘적 경로를 제시한다는 것을 시사합니다.

Original Abstract

Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!