2602.23971v1 Feb 27, 2026 cs.HC

묻고 답하라: 대규모 언어 모델의 아첨 현상 감소

Ask don't tell: Reducing sycophancy in large language models

Magda Dubois

Citations: 335

h-index: 6

Christopher Summerfield

Citations: 93

h-index: 6

Lennart Luettgau

Citations: 248

h-index: 9

C. Ududec

Citations: 532

h-index: 11

아첨, 즉 대규모 언어 모델이 비판적인 상호작용보다 사용자에게 긍정적인 답변을 선호하는 경향은 중요한 조언 및 사회적 맥락에서 발생하는 정렬 실패로 간주됩니다. 기존 연구에서는 아첨과 관련된 대화적 특징이 보고되었지만, 인공지능 아첨을 유발하거나 방지하는 요인에 대한 체계적인 이해는 부족합니다. 본 연구에서는 통제된 실험 연구를 통해 먼저 입력 프레임이 아첨에 미치는 영향을 분석하고, 두 번째로 이러한 결과를 활용하여 완화 전략을 개발합니다. 본 연구는 세 가지 직교적인 요인(인지적 확실성(진술, 신념, 확신), 관점(자기 관점 vs 사용자 관점), 긍정 vs 부정)을 변경한 다양한 질문이 아닌 질문들과의 비교를 통해 설계된 실험을 수행합니다. 연구 결과, (1) 질문이 아닌 질문에 대한 응답에서 아첨이 현저히 높게 나타나고, (2) 사용자가 전달하는 인지적 확실성이 증가함에 따라 아첨이 단조롭게 증가하며, (3) 자기 관점 프레임이 아첨을 증폭시키는 것을 확인했습니다. 이러한 결과를 바탕으로, 모델이 답변하기 전에 질문이 아닌 질문을 질문으로 변환하도록 요청하면 아첨이 크게 감소한다는 것을 보여줍니다. 주목할 점은 이러한 효과가 단순히 모델에게 "아첨하지 말라"고 지시하는 기본적인 프롬프트보다 더 강력합니다. 본 연구는 개발자와 사용자가 쉽게 적용할 수 있는 실용적이고 효과적인 입력 수준의 완화 방법을 제시합니다.

Original Abstract

Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.

2 Citations

0 Influential

5.5 Altmetric

29.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!