2604.03754v1 Apr 04, 2026 cs.CL

LLM에서 진실 방향성의 한계 탐구

Testing the Limits of Truth Directions in LLMs

Mark Crovella

Citations: 18

h-index: 2

Angelos Poulis

Citations: 10

h-index: 1

Evimaria Terzi

Citations: 5,807

h-index: 34

대규모 언어 모델(LLM)은 문장의 진실성을 활성화 공간 내의 선형적인 진실 방향으로 표현한다는 것이 밝혀졌습니다. 이전 연구에서는 이러한 방향성이 특정 측면에서 보편적이라고 주장했지만, 최근 연구에서는 일부 환경에서의 제한적인 일반화 가능성을 지적하며 이러한 결론에 의문을 제기했습니다. 본 연구에서는 이전에 명확히 밝혀지지 않았던 진실 방향성의 보편성에 대한 여러 가지 한계를 제시합니다. 먼저, 진실 방향성이 모델의 레이어에 따라 크게 달라지며, 보편성을 완전히 이해하기 위해서는 모델의 다양한 레이어를 탐색해야 함을 보여줍니다. 또한, 진실 방향성이 작업 유형에 크게 의존하며, 사실 기반 작업은 초기에 나타나고 추론 작업은 후기에 나타나는 경향이 있으며, 작업의 복잡도 수준에 따라 성능이 달라짐을 보여줍니다. 마지막으로, 모델의 지시사항이 진실 방향성에 큰 영향을 미치며, 간단한 정확성 평가 지시사항이 진실 탐색의 일반화 능력에 상당한 영향을 미친다는 것을 보여줍니다. 우리의 연구 결과는 진실 방향성에 대한 보편성 주장이 이전보다 더 제한적이며, 다양한 모델 레이어, 작업 난이도, 작업 유형 및 프롬프트 템플릿에 따라 상당한 차이가 나타날 수 있음을 시사합니다.

Original Abstract

Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple correctness evaluation instructions significantly affect the generalization ability of truth probes. Our findings indicate that universality claims for truth directions are more limited than previously known, with significant differences observable for various model layers, task difficulties, task types, and prompt templates.

0 Citations

0 Influential

17 Altmetric

85.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!