2602.02600v2 Feb 01, 2026 cs.LG

자기 회귀 및 확산 언어 모델에서의 단계별 거부 동역학

Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Amit Levi

Citations: 27

h-index: 3

Avi Mendelson

Citations: 47

h-index: 3

Eliron Rahimi

Citations: 4

h-index: 2

Elad Hirshel

Citations: 2

h-index: 1

Rom Himelstein

Citations: 23

h-index: 3

Chaim Baskin

Ben-Gurion University of the Negev

Citations: 1,165

h-index: 15

최근 확산 언어 모델(DLM)은 병렬 디코딩 및 제어 가능한 샘플링 동역학을 제공하면서, 대규모 환경에서 경쟁력 있는 생성 품질을 달성하는 자기 회귀(AR) 모델의 유망한 대안으로 부상했습니다. 이러한 발전에도 불구하고, 샘플링 메커니즘이 거부 행동 및 잠재적인 악용 방어에 미치는 영향은 여전히 명확하게 이해되지 못하고 있습니다. 본 연구에서는 단계별 거부 동역학에 대한 기본적인 분석 프레임워크를 제시하여 AR 및 확산 샘플링 간의 비교를 가능하게 합니다. 우리의 분석 결과, 샘플링 전략 자체가 안전 행동에 중요한 역할을 하며, 이는 기본적으로 학습된 표현과는 별개의 요소임을 보여줍니다. 이러한 분석을 바탕으로, 우리는 해석 가능성을 높이고 AR 및 DLM 모두의 안전성을 향상시키는 Step-Wise Refusal Internal Dynamics (SRI) 신호를 도입했습니다. SRI의 기하학적 구조가 내부 복구 동역학을 나타내며, 유해한 생성에서 발생하는 이상 행동을 텍스트 수준에서는 관찰할 수 없는 "불완전한 내부 복구"의 사례로 식별한다는 것을 보여줍니다. 이러한 구조는 경량의 추론 시간 감지기를 가능하게 하며, 이는 알려지지 않은 공격에 대한 일반화 성능을 보이며, 기존 방어 기법보다 100배 이상 낮은 추론 오버헤드로 동등하거나 더 나은 성능을 제공합니다.

Original Abstract

Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present a fundamental analytical framework for step-wise refusal dynamics, enabling comparison between AR and diffusion sampling. Our analysis reveals that the sampling strategy itself plays a central role in safety behavior, as a factor distinct from the underlying learned representations. Motivated by this analysis, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which supports interpretability and improved safety for both AR and DLMs. We demonstrate that the geometric structure of SRI captures internal recovery dynamics, and identifies anomalous behavior in harmful generations as cases of \emph{incomplete internal recovery} that are not observable at the text level. This structure enables lightweight inference-time detectors that generalize to unseen attacks while matching or outperforming existing defenses with over $100\times$ lower inference overhead.

2 Citations

0 Influential

7.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!