2605.05715v1 May 07, 2026 cs.AI

고정 잔차 스트림 선형 조향으로 수정되지 않는 디코더 가능한 오류 신호: 의료 LLM 실패 사례 연구

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

Citations: 47

h-index: 4

LLM의 숨겨진 상태에서 선형적으로 디코딩 가능한 오류 신호가 이러한 오류를 수정하는 데 활용될 수 있는가? 우리는 Overthinking (OT)이라는 안정적인 행동 양식을 통해 이러한 분류-수정 격차를 조사합니다. OT는 의료 질의응답 분야에서 모델이 리샘플링 하에서는 정확하게 답변하지만, 확장된 연쇄적 사고 과정에서는 실패하는 경우를 나타내며, Jaccard 지수가 0.81 이상이고 94%의 높은 평가자 간 일치도를 보입니다. OT는 71.6%의 균형 잡힌 정확도를 가지며 선형적으로 디코딩 가능합니다 (p < 10^{-16}). 그러나 고정된 선형 조향의 다섯 가지 유형 (29가지 구성, n=1,273)은 모두 Delta 값이 약 0으로, 아키텍처 (Qwen2.5-7B) 및 도메인 (MMLU-STEM)에 관계없이 동일한 결과를 보였습니다. 세 가지 주요 증거는 표현의 얽힘을 시사합니다. OT 방향은 작업에 중요한 계산과 85-88%의 중복성을 가지며 (specificity ratio <= 0.152); 목표가 아닌 공유 방향 조향은 정확도를 저하시킵니다 (-12.1pp); LEACE 개념 제거는 정확도를 저하시킵니다 (-3.6pp, p=0.01), 반면 10개의 무작위 제거는 Delta 값이 +0.3pp를 나타냅니다. 개별 인스턴스에 대한 프로브-조향 상관 관계는 r=-0.002 (p=0.97)입니다. 긍정적인 측면으로, 동일한 프로브를 사용하여 선택적 회피를 가능하게 합니다 (held-out AUROC=0.610, 모든 다섯 가지 불확실성 기준치를 초과, p=0.009): 디코딩 가능한 오류 구조는 고정된 선형 조향 패밀리가 이를 수정하는 데 활용하지 못하더라도 생성 후 신뢰성 추정을 가능하게 합니다.

Original Abstract

Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!