2601.22594v1 Jan 30, 2026 cs.CL

언어 모델 회로는 뉴런 기반에서 희소하게 나타난다

Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora

Stanford University

Citations: 1,180

h-index: 15

Zhengxuan Wu

Citations: 174

h-index: 4

Jacob Steinhardt

Citations: 113

h-index: 5

Sarah Schwettmann

Citations: 109

h-index: 5

신경망이 연산을 수행하기 위해 사용하는 고차원적인 개념들이 반드시 개별 뉴런에 일대일로 대응될 필요는 없다 (Smolensky, 1986). 따라서, 언어 모델 해석 연구는 `희소 오토인코더(sparse autoencoders, SAEs)`와 같은 기법을 활용하여 모델 연산의 뉴런 기반을 보다 해석 가능한 단위로 분해하고, `회로 추적(circuit tracing)`과 같은 작업을 수행해왔다. 하지만, 모든 뉴런 기반 표현이 해석 불가능한 것은 아니다. 본 연구에서는 처음으로 실험적으로, **MLP (Multi-Layer Perceptron) 뉴런이 SAE와 동등한 수준의 희소한 특징 기반을 제공한다는 것을 입증했다**. 우리는 이러한 발견을 바탕으로, MLP 뉴런 기반에서 회로 추적을 위한 완전한 파이프라인을 개발했다. 이 파이프라인은 기울기 기반 기법을 사용하여 다양한 작업에서 인과 관계를 갖는 회로를 찾아낸다. 표준적인 주어-동사 일치 벤치마크(Marks et al., 2025)에서, 약 10^2개의 MLP 뉴런으로 구성된 회로만으로 모델의 동작을 제어할 수 있다. Lindsey et al., 2025에서 제시한 다단계 도시-주-수도 추론 작업에서, 소규모 뉴런 집합이 특정 잠재적 추론 단계를 인코딩하며 (예: `도시를 해당 주에 매핑`), 모델의 출력을 변경하는 데 활용될 수 있음을 발견했다. 따라서 본 연구는 추가적인 학습 비용 없이 언어 모델의 자동 해석을 발전시킨다.

Original Abstract

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

11 Citations

0 Influential

7.5 Altmetric

48.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!