2601.23086v1 Jan 30, 2026 cs.AI

출력 감독을 통해 학습된 생각의 사슬(Chain-of-thought) 난독화는 보지 못한 작업으로 일반화될 수 있다

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

N. Hadida

Citations: 2

h-index: 1

Sassan Bhanji

Citations: 1

h-index: 1

Cameron Tice

Citations: 32

h-index: 2

Puria Radmard

Citations: 71

h-index: 4

생각의 사슬(Chain-of-thought, CoT) 추론은 거대언어모델(LLM)이 자신의 행동을 계획, 탐색 및 숙고할 수 있게 함으로써 상당한 성능 향상을 제공합니다. CoT는 또한 이러한 에이전트의 행동을 모니터링하는 강력한 도구이기도 합니다. CoT가 충실하게 수행될 경우, 이는 모델의 의사 결정 과정에 대한 해석을 제공하고 위험한 행동에 대한 조기 경고 신호 역할을 합니다. 그러나 CoT에 가해지는 최적화 압력은 모델이 추론 흔적을 난독화하게 하여 이러한 유익한 특성을 잃게 만들 수 있습니다. 우리는 난독화가 여러 작업에 걸쳐 일반화될 수 있음을 보여줍니다. 보상 해킹(예: 유출된 정보에 접근하여 활용하는 것)과 관련된 추론을 난독화하도록 학습한 모델은 보상 해킹 행동과 CoT 내에서의 난독화 모두를 이전에 보지 못한 보상 해킹 환경으로 일반화합니다. 가장 우려스러운 점은, 모델의 CoT가 종료된 후 최종 행동에 대해서만 페널티를 부과할 때도 CoT 추론의 난독화와 작업 간의 일반화가 발생한다는 것입니다. 우리의 연구 결과는 해로운 생성을 처벌하는 현재의 관행이 의도치 않게 예측할 수 없는 방식으로 LLM의 광범위한 모니터링 가능성을 감소시킬 수 있음을 시사합니다.

Original Abstract

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!