2602.23163v1 Feb 26, 2026 cs.AI

LLM 모니터링을 위한 스테가노그래피의 의사결정론적 형식화

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar

Citations: 1,547

h-index: 10

Julianna Piskorz

Citations: 18

h-index: 3

David D. Baek

Citations: 110

h-index: 4

D. Africa

Citations: 79

h-index: 5

James Weatherall

Citations: 50

h-index: 2

Max Tegmark

Citations: 678

h-index: 10

Christian Schroeder de Witt

Citations: 13

h-index: 2

M. Schaar

Citations: 28,140

h-index: 76

David Krueger

Citations: 14

h-index: 2

최근 대규모 언어 모델(LLM)은 스테가노그래피 기능을 나타내는 경향을 보이고 있습니다. 이러한 기능은 목표와 어긋나는 모델이 감시 메커니즘을 회피하는 데 사용될 수 있습니다. 하지만 이러한 현상을 탐지하고 정량화할 수 있는 체계적인 방법은 아직 부족합니다. 기존의 스테가노그래피 정의 및 이를 기반으로 한 탐지 방법은 스테가노그래피가 적용되지 않은 신호의 기준 분포를 필요로 합니다. LLM에서의 스테가노그래피 추론의 경우, 이러한 기준 분포를 알기는 어렵기 때문에 이러한 접근 방식은 적용 가능하지 않습니다. 본 연구에서는 스테가노그래피에 대한 새로운 관점, 즉 **의사결정론적 관점**을 제시합니다. 핵심적인 아이디어는 스테가노그래피가 숨겨진 내용을 해독할 수 있는 에이전트와 해독할 수 없는 에이전트 간에 유용한 정보의 불균형을 초래한다는 것입니다. 이러한 잠재적인 불균형은 에이전트의 관찰 가능한 행동으로부터 추론될 수 있습니다. 이러한 관점을 형식화하기 위해, 입력 데이터 내의 유용한 정보의 양을 측정하는 공리적인 프레임워크인 일반화된 $\mathcal{V}$-정보를 도입합니다. 이를 사용하여 **스테가노그래피 격차(steganographic gap)**를 정의합니다. 스테가노그래피 격차는 스테가노그래피 신호의 다운스트림 유용성을 해독 가능/불가능한 에이전트 간에 비교하여 스테가노그래피를 정량화하는 척도입니다. 본 연구에서는 제안된 형식을 실험적으로 검증하고, 이를 사용하여 LLM에서 스테가노그래피 추론을 탐지, 정량화 및 완화할 수 있음을 보여줍니다.

Original Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

2 Citations

1 Influential

30 Altmetric

154.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!