2604.11467v1 Apr 13, 2026 cs.AI

인과관계에서 실행으로: 인간 중심적 관점에서의 활성화 제어 응용

From Attribution to Action: A Human-Centered Application of Activation Steering

T. Labarta

Citations: 54

h-index: 3

Maximilian Dreyer

Citations: 656

h-index: 12

Wojciech Samek

Citations: 800

h-index: 13

S. Lapuschkin

Citations: 10,873

h-index: 34

Katharina Weitz

Citations: 951

h-index: 14

설명 가능한 인공지능(XAI) 방법은 모델 예측에 영향을 미치는 특징을 밝혀내지만, 실무자들이 이러한 설명을 바탕으로 실제로 조치를 취할 수 있는 방법은 제한적입니다. XAI를 통해 식별된 구성 요소의 활성화 제어는 실행 가능한 설명을 제공할 수 있는 잠재력을 가지고 있지만, 그 실질적인 유용성은 아직 충분히 연구되지 않았습니다. 본 연구에서는 SAE 기반의 기여도 분석과 활성화 제어를 결합한 대화형 워크플로우를 개발하여, 시각 모델에서 개념 사용에 대한 개별 레벨 분석을 수행하는 웹 기반 도구를 구현했습니다. 이 워크플로우를 기반으로, CLIP 모델에 대한 디버깅 작업을 수행하는 전문가 인터뷰(N=8)를 반구조적으로 진행하여, 실무자들이 활성화 제어에 대해 어떻게 생각하고, 신뢰하며, 활용하는지 조사했습니다. 연구 결과, 활성화 제어는 단순한 검토에서 벗어나, 개입 기반의 가설 검증으로 이어지는 현상을 보였습니다(8/8 참여자). 대부분의 참여자는 모델의 응답을 통해 신뢰를 구축했으며, 설명의 타당성만으로 신뢰를 형성하지 않았습니다(6/8). 참여자들은 주로 구성 요소 억제를 중심으로 하는 체계적인 디버깅 전략을 사용했으며, 연쇄 효과 및 개별 레벨 수정의 일반화 가능성 제한과 같은 위험 요소를 강조했습니다. 종합적으로 볼 때, 활성화 제어는 해석 가능성을 더욱 실질적인 행동으로 연결하지만, 안전하고 효과적인 사용을 위한 중요한 고려 사항을 제시합니다.

Original Abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

0 Citations

0 Influential

17 Altmetric

85.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!