2605.05686v1 May 07, 2026 cs.AI

트랜스포머 메모리의 속성 기하학: 충돌(Conflict) 해결에서 확신에 찬 환각(Hallucination)까지

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang

Citations: 54

h-index: 4

I. Fiete

Citations: 6,168

h-index: 36

Risto Miikkulainen

Citations: 80

h-index: 4

언어 모델은 두 가지 지식 소스를 활용합니다. 모델 가중치에 내장된 사실(매개변수 메모리, PM)과 문맥 내 정보(작업 메모리, WM)가 그 예입니다. 본 연구에서는 두 가지 메커니즘적으로 구별되는 오류 모드인 '충돌(conflict)'과 '환각(hallucination)'을 분석합니다. '충돌'은 PM과 WM이 서로 상충하여 간섭을 일으키는 경우를, '환각'은 쿼리된 사실이 학습된 적이 없는 경우를 의미합니다. 두 경우 모두 확신에 찬 결과를 생성하므로, 출력 기반 모니터링은 본질적으로 맹목적입니다. 본 연구는 이러한 두 가지 오류 모드가 통합된 기하학적 설명을 공유함을 보여줍니다. 자기 회귀 생성의 숨겨진 상태 공간에서, 학습된 사실은 속성(attractor) 지점을 형성합니다. '충돌'은 속성 지점 간의 경쟁이며, WM은 출력 엔트로피를 증가시키지 않고도 올바른 지점으로의 수렴을 방해합니다. '환각'은 속성 지점의 부재이며, 저장된 지점이 존재하지 않으면 숨겨진 상태가 자유롭게 이동합니다. 다음 토큰 예측을 위해 설계된 고정된 LM 헤드는 이러한 경우를 구별할 수 없으며, 어떤 경우에도 확신에 찬 결과를 출력합니다. 본 연구는 제어된 합성 작업에서 이러한 설명을 검증합니다. 이 작업에서는 개체 식별자가 고유한 코드로 매핑되고, LoRA 어댑터를 통해 PM이 설치됩니다. 이 경우, 정확한 진실 값을 알 수 있으며, 어댑터의 위치를 조정하여 각 구성 요소의 역할을 인과적으로 분리할 수 있습니다. 기하학적 마진(hidden state와 가장 가까운 저장된 지점 간의 거리)은 이 기하학을 직접적으로 나타내며, 출력 엔트로피보다 정확한 회수와 환각을 훨씬 더 명확하게 구분합니다. 엔트로피 기반 탐지는 대부분의 정확한 출력을 거부하는 반면, 기하학적 마진은 오탐이 없습니다. 본 연구는 사전 훈련된 모델에서 자연어 사실 쿼리에 대해 어떠한 추가 튜닝 없이도 이러한 구분이 유지됨을 확인했으며, 이는 속성 기하학이 미세 조정의 부산물이 아니라 구조적인 특징임을 시사합니다. 확신에 찬 환각의 비율은 스케일링 법칙 $C = exp(-c/arΔ)$을 따르며, 전체 오류율이 감소하더라도 스케일과 함께 증가합니다. 숨겨진 상태는 신뢰성 있게 인식 상태를 인코딩하지만, 고정된 출력 헤드는 이를 체계적으로 제거하며, 이러한 제거는 스케일이 커짐에 따라 악화됩니다.

Original Abstract

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task--entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\barΔ)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

0 Citations

0 Influential

18 Altmetric

90.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!