2603.05494v1 Mar 05, 2026 cs.LG

검열된 LLM을 활용한 숨겨진 정보 추출을 위한 자연스러운 실험 환경

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Neel Nanda

Citations: 10,183

h-index: 35

Helena Casademunt

Citations: 25

h-index: 2

Bartosz Cywi'nski

Citations: 80

h-index: 5

K. Tran

Citations: 4

h-index: 1

Arya Jakkli

Citations: 1

h-index: 1

Samuel Marks

Citations: 47

h-index: 5

대규모 언어 모델(LLM)은 때때로 잘못되거나 오해를 불러일으키는 답변을 생성하는 경우가 있습니다. 이러한 문제에 대한 두 가지 접근 방식은 다음과 같습니다. 첫째, 모델이 진실을 말하도록 프롬프트나 가중치를 수정하는 '정직성 유도'이고, 둘째, 주어진 답변이 거짓인지 여부를 판단하는 '거짓말 탐지'입니다. 기존 연구에서는 특정 모델을 거짓말하거나 정보를 숨기도록 훈련시킨 모델을 사용하여 이러한 방법을 평가했지만, 이러한 인위적인 구성은 자연스러운 부정직과 유사하지 않을 수 있습니다. 본 연구에서는 중국 개발자가 개발한 오픈 웨이트 LLM을 사용하여 정치적으로 민감한 주제에 대한 검열을 수행하도록 훈련된 모델을 연구합니다. Qwen3 모델은 종종 팔룬궁이나 천안문 시위와 같은 주제에 대해 허위 정보를 생성하는 동시에 때때로 올바르게 답변하는데, 이는 모델이 억압하도록 훈련된 지식을 보유하고 있음을 나타냅니다. 이러한 모델을 실험 환경으로 사용하여 다양한 정보 추출 및 거짓말 탐지 기술을 평가합니다. 정직성 유도를 위해 채팅 템플릿 없이 샘플링, few-shot 프롬프팅 및 일반적인 정직성 데이터에 대한 미세 조정이 가장 효과적으로 진실성 있는 답변을 증가시키는 것으로 나타났습니다. 거짓말 탐지의 경우, 검열된 모델에 자신의 답변을 분류하도록 지시하면 검열되지 않은 모델의 상한선에 가까운 성능을 보이며, 관련 없는 데이터로 훈련된 선형 프로브는 더 저렴한 대안을 제공합니다. 가장 강력한 정직성 유도 기술은 DeepSeek R1을 포함한 최첨단 오픈 웨이트 모델에도 적용됩니다. 주목할 점은 어떤 기술도 거짓 답변을 완전히 제거하지 못합니다. 본 연구에서는 사용된 모든 프롬프트, 코드 및 트랜스크립트를 공개합니다.

Original Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

1 Citations

0 Influential

17.5 Altmetric

88.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!