2604.13258v1 Apr 14, 2026 cs.CL

헤세이안 기반 토큰 기여도 향상 (HETA): 자기 회귀 언어 모델 해석

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

Vishal Pramanik

Citations: 14

h-index: 2

Maisha Maliha

Citations: 5

h-index: 2

S. Jha

Citations: 749

h-index: 15

Nathaniel D. Bastian

Citations: 2,395

h-index: 25

토큰 기여도 분석 방법은 입력 토큰이 생성된 출력에 기여하는 정도를 측정하여 언어 모델의 예측을 설명하고자 합니다. 그러나 대부분의 기존 기법은 인코더 기반 아키텍처에 맞춰 설계되었으며, 디코더 전용 모델에서 나타나는 자기 회귀 생성 과정의 인과 관계 및 의미적 복잡성을 제대로 반영하지 못하는 선형 근사 방식을 사용합니다. 이러한 한계를 극복하기 위해, 본 연구에서는 디코더 전용 언어 모델에 특화된 새로운 기여도 분석 프레임워크인 헤세이안 기반 토큰 기여도 향상 (HETA)을 제안합니다. HETA는 세 가지 상호 보완적인 구성 요소로 구성됩니다. 첫째, 계층 간 토큰 간의 영향을 포착하는 의미 변환 벡터, 둘째, 2차 효과를 모델링하는 헤세이안 기반 민감도 점수, 셋째, 토큰이 마스킹될 때 발생하는 정보 손실을 측정하는 KL 발산입니다. 이러한 통합 설계는 문맥 인지적이고, 인과 관계를 정확하게 반영하며, 의미적으로 타당한 기여도를 제공합니다. 또한, 생성 환경에서 기여도 분석의 품질을 체계적으로 평가하기 위한 큐레이션된 벤치마크 데이터 세트를 소개합니다. 다양한 모델과 데이터 세트에 대한 실험 결과, HETA는 기존 방법보다 기여도 분석의 정확성과 인간 어노테이션과의 일관성 측면에서 꾸준히 우수한 성능을 보이며, 자기 회귀 언어 모델 해석 분야의 새로운 표준을 제시합니다.

Original Abstract

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

1 Citations

0 Influential

12.5 Altmetric

63.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!