2604.09839v1 Apr 10, 2026 cs.AI

지향된 LLM 활성화는 전사적 함수가 아님

Steered LLM Activations are Non-Surjective

Daniel Khashabi

Citations: 353

h-index: 9

Aayush Mishra

Citations: 64

h-index: 3

Anqi Liu

Citations: 67

h-index: 6

활성화 지향(activation steering)은 모델 활성화를 수정하여 출력 동작의 추상적인 변화를 유도하는 인기 있는 내부 작동 방식 제어 기술입니다. 또한, 설명 가능성(예: 진실성 검증, 활성화를 사람이 읽을 수 있는 설명으로 변환) 및 안전성 연구(예: 탈옥 가능성 연구)에서 표준적인 도구로 사용되기도 합니다. 그러나, 지향된 활성화 상태가 어떠한 텍스트 프롬프트를 통해 구현 가능한지에 대한 명확성은 아직 부족합니다. 본 연구에서는 이 질문을 전사성 문제로 정의합니다. 즉, 특정 모델에 대해 모든 지향된 활성화가 모델의 자연스러운 순방향 연산을 통해 생성될 수 있는 원본 데이터(pre-image)를 가지는가 하는 문제입니다. 실질적인 가정 하에, 활성화 지향은 잔차 스트림을 이산적인 프롬프트로부터 도달 가능한 상태들의 다양체에서 벗어나게 만듭니다. 거의 확실하게, 어떤 프롬프트도 지향에 의해 유도되는 동일한 내부 동작을 재현할 수 없습니다. 또한, 본 연구는 이 결과를 세 가지 널리 사용되는 LLM에 대해 경험적으로 보여줍니다. 우리의 결과는 내부 작동 방식 제어 가능성과 블랙박스 프롬프트 간의 명확한 분리를 확립합니다. 따라서, 활성화 지향의 용이성과 성공을 프롬프트 기반의 설명 가능성 또는 취약성의 증거로 해석하는 것에 주의해야 하며, 내부 작동 방식 및 블랙박스 개입을 명시적으로 분리하는 평가 프로토콜의 필요성을 주장합니다.

Original Abstract

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a pre-image under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!