2604.23829v1 Apr 26, 2026 cs.AI

도메인 필터링된 지식 그래프: 희소 오토인코더 특징 기반

Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

John Winnicki

Citations: 8

h-index: 2

Abeynaya Gnanasekaran

Citations: 154

h-index: 7

Eric Darve

Citations: 9

h-index: 2

희소 오토인코더(SAE)는 언어 모델로부터 수백만 개의 해석 가능한 특징을 추출하지만, 단순한 특징 목록은 자체적으로는 그다지 유용하지 않습니다. 도메인 개념이 일반적이고 지지력이 약한 특징과 혼합되어 있으며, 관련된 아이디어는 여러 단위에 흩어져 있고, 특징 간의 관계를 이해할 방법이 없습니다. 우리는 이러한 문제를 해결하기 위해, 먼저 대규모 SAE 데이터에서 대조 활성화를 사용하고 다단계 필터링 프로세스를 통해 엄격한 도메인별 개념 우주를 구축합니다. 다음으로, 필터링된 데이터셋에 대해 두 가지 정렬된 그래프 뷰를 구축합니다. 첫째, 코퍼스 수준의 개념적 구조를 나타내는 다단계 세분성 수준의 공존 그래프이고, 둘째, 희소 잠재 경로를 통해 소스 레이어와 타겟 레이어 특징을 연결하는 트랜스코더 기반 메커니즘 그래프입니다. 자동화된 엣지 레이블링을 통해 이러한 그래프 뷰는 레이블이 없는 단순한 구조가 아닌, 읽기 쉬운 지식 그래프로 변환됩니다. 생물학 교과서에 대한 사례 연구에서, 이러한 그래프는 일관된 장 및 소장 수준의 구조를 복원하고, 인접한 주제를 연결하는 개념을 드러내며, 수천 개의 특징을 포함하는 혼란스러운 문장 수준의 활동을 간결하고 읽기 쉬운 뷰로 변환하여 모델의 로컬 활동을 보여줍니다. 종합적으로, 이 연구는 단순한 SAE 특징 목록을 내부 지식 그래프로 재구성하여 특징 수준의 해석 가능성을 모델 지식의 전역적인 지도로 변환하고, 추론의 신뢰성을 감사할 수 있도록 합니다.

Original Abstract

Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model's local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!