2601.09445v1 Jan 14, 2026 cs.CL

지식이 충돌하는 지점: 언어 모델 내 기억 지식 충돌의 메커니즘 연구

Where Knowledge Collides: A Mechanistic Study of Intra-Memory Knowledge Conflict in Language Models

Minh V. T. Pham

Citations: 29

h-index: 2

Hsuvas Borkakoty

Citations: 267

h-index: 6

Yufang Hou

Citations: 105

h-index: 3

언어 모델(LM)에서, 내적 지식 충돌은 주로 모델의 파라미터 지식 내에 동일한 사건에 대한 일관되지 않은 정보가 저장될 때 발생합니다. 기존 연구는 주로 모델의 내부 지식과 외부 자원 간의 충돌을 해결하는 데 초점을 맞추며, 미세 조정(fine-tuning) 또는 지식 편집과 같은 방법을 사용했습니다. 그러나 사전 훈련 과정에서 발생하는 충돌이 모델의 내부 표현 내에 어떻게 위치하는지에 대한 문제는 아직 탐구되지 않았습니다. 본 연구에서는 메커니즘 해석 방법론에 기반한 프레임워크를 설계하여, 사전 훈련 데이터에서 비롯된 충돌적인 지식이 언어 모델 내에서 어디에, 어떻게 저장되는지 식별합니다. 우리의 연구 결과는 언어 모델의 특정 내부 구성 요소가 사전 훈련에서 비롯된 충돌적인 지식을 저장하는 역할을 한다는 증거를 뒷받침하며, 메커니즘 해석 방법을 사용하여 추론 시 충돌적인 지식에 인과적으로 개입하고 제어할 수 있음을 보여줍니다.

Original Abstract

In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has primarily focused on resolving conflicts between a model's internal knowledge and external resources through approaches such as fine-tuning or knowledge editing, the problem of localizing conflicts that originate during pre-training within the model's internal representations remain unexplored. In this work, we design a framework based on mechanistic interpretability methods to identify where and how conflicting knowledge from the pre-training data is encoded within LMs. Our findings contribute to a growing body of evidence that specific internal components of a language model are responsible for encoding conflicting knowledge from pre-training, and we demonstrate how mechanistic interpretability methods can be leveraged to causally intervene in and control conflicting knowledge at inference time.

2 Citations

0 Influential

3 Altmetric

17.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!