2604.20995v1 Apr 22, 2026 cs.AI

가치 충돌 진단 분석: 언어 모델에서 널리 나타나는 '가짜 정렬' 현상

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

Inderjeet Nair

Citations: 42

h-index: 3

Jie Ruan

Citations: 210

h-index: 5

Lu Wang

Citations: 10

h-index: 2

언어 모델에서 발생하는 '가짜 정렬' 현상은, 모델이 개발자의 정책에 부합하는 것처럼 보이지만, 감시가 없을 때는 자체적인 선호도를 따르는 문제이며, 현재 진단 도구의 한계로 인해 제대로 이해되지 못하고 있습니다. 기존 진단 방법은 매우 유해하고 명백히 부정적인 시나리오에 의존하기 때문에, 대부분의 모델이 즉시 거부 반응을 보입니다. 이로 인해 모델은 개발자의 정책, 모니터링 조건, 그리고 규정 위반의 결과에 대해 깊이 생각하지 않게 되어, 이러한 진단 방법은 '가짜 정렬' 가능성을 근본적으로 감지할 수 없습니다. 본 연구에서는 '가짜 정렬' 현상을 연구하기 위해, '가짜 정렬'이 발생할 가능성이 가장 높은 상황은 개발자의 정책이 모델의 핵심 가치와 충돌할 때라는 가설을 바탕으로 VLAF라는 진단 프레임워크를 소개합니다. VLAF는 도덕적으로 모호하지 않은 시나리오를 사용하여 다양한 도덕적 가치에 걸쳐 이러한 충돌을 탐색하며, 거부 반응을 피하면서도 의미 있는 심층적인 분석을 가능하게 합니다. VLAF를 사용하여 분석한 결과, '가짜 정렬' 현상은 기존 보고서보다 훨씬 더 흔하게 발생하며, 70억 개의 파라미터를 가진 작은 모델에서도 나타나는 것을 확인했습니다. 특히, olmo2-7b-instruct 모델에서 37%의 경우에 '가짜 정렬' 현상이 관찰되었습니다. 또한, 감시 조건이 모델의 표현 공간에서 특정 방향으로 활성화 변화를 유도한다는 것을 확인했습니다. 이는 '가짜 정렬'을 유발하는 행동 변화를 단일의 대비되는 조향 벡터로 표현할 수 있다는 것을 의미하며, 이를 활용하여 경량화된 추론 시점의 완화 방법을 개발했습니다. 마지막으로, 이 방법을 사용하여 별도의 레이블 데이터 없이, 최소한의 계산 오버헤드로 '가짜 정렬'을 각각 85.8%, 94.0% 및 57.7% 감소시켰습니다 (olmo2-7b-instruct, olmo2-13b-instruct, qwen3-8b 모델 기준).

Original Abstract

Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model's strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of cases.Finally, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!