2604.27093v1 Apr 29, 2026 cs.CL

쓸모없지만 안전한가? 다중 턴 대화에서 사용자 의도 명확화를 통한 유용성 복구 벤치마킹

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

M. Sap

Citations: 275

h-index: 7

Mingqian Zheng

Citations: 14

h-index: 1

Malia Morgan

Citations: 15

h-index: 1

Liwei Jiang

Citations: 157

h-index: 6

Carolyn Rose

Citations: 1

h-index: 1

현재 LLM 안전 정렬 기술은 모델의 적대적 공격에 대한 견고성을 향상시키지만, 선의의 사용자가 의도를 명확히 할 때 LLM이 유용성을 어떻게 회복하는지, 그리고 그러한 회복이 가능한지 간과합니다. 본 연구에서는 CarryOnBench를 소개합니다. CarryOnBench는 LLM이 다중 턴 대화를 통해 사용자 의도를 수정하고 유용성을 회복하는지 측정하는 최초의 인터랙티브 벤치마크입니다. 398개의 겉보기에는 해로운 질문이지만, 실제로는 선의의 의도를 가진 질문을 시작으로, 사용자 후속 응답 시나리오를 다양하게 변경하여 5,970개의 대화를 시뮬레이션하고, 14개의 모델을 의도에 부합하는 유용성과 안전성 측면에서 평가했습니다. CarryOnBench는 4~12턴의 1,866가지 다양한 대화 흐름을 생성했으며, 총 23,880개의 모델 응답을 포함합니다. 각 모델 응답이 사용자의 선의의 정보 요구를 얼마나 잘 충족하는지 평가하는 체크리스트 기반 지표인 Ben-Util을 설계했습니다. 첫 번째 턴에서 모델은 사용자의 선의의 정보 요구를 10.5~37.6%만 충족합니다. 동일한 질문에 선의의 의도가 명시적으로 포함된 경우, 모델은 25.1~72.1%를 충족하며, 이는 모델이 지식 부족이 아닌 의도 오해로 인해 정보를 회피한다는 것을 확인시켜줍니다. 다중 턴 대화에서 선의의 명확화가 포함되면, 14개의 모델 중 13개가 이 단일 턴 기준선에 근접하거나 초과하는 성능을 보이지만, 모델별로 회복 비용은 다릅니다. 우리는 단일 턴 평가로는 파악할 수 없는 세 가지 실패 모드를 식별했습니다. 첫째는 명확화에도 불구하고 모델이 거의 업데이트되지 않는 '유용성 고착(utility lock-in)'입니다. 둘째는 안전성을 심각하게 훼손하면서 업데이트되는 '안전하지 않은 복구(unsafe recovery)'입니다. 셋째는 이전 응답을 반복적으로 사용하는 '반복적인 복구(repetitive recovery)'입니다. 또한, 모델이 얼마나 보수적으로 시작하든, 대화는 유사한 수준의 위험으로 수렴하는 경향이 있습니다. 이러한 결과는 단일 턴 평가에서 간과되는 격차를 드러냅니다. 즉, 모델이 적절한 주의를 기울이는 것인지, 아니면 단순히 명확화된 사용자 의도에 반응하지 못하는 것인지입니다.

Original Abstract

Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We introduce CarryOnBench, the first interactive benchmark that measures whether LLMs can revise their interpretation of user intent and recover utility, while remaining safe through multi-turn conversations. Starting from 398 seemingly harmful queries with benign underlying intents, we simulate 5,970 conversations by varying user follow-up sequences, evaluating 14 models on both intent-aligned utility and safety. CarryOnBench yields 1,866 different conversation flows of 4--12 turns, totaling 23,880 model responses. We design Ben-Util, a checklist-based metric that evaluates how well each model response fulfills the user's benign information need using atomic items. At turn one, models fulfill only 10.5--37.6% of the user's benign information need. When the same query includes the benign intent upfront, models fulfill 25.1--72.1%, confirming that models withhold information due to intent misinterpretation, not limited knowledge. With benign clarifications in multi-turn conversations, 13 of 14 models approach or exceed this single-turn baseline, yet recovery cost varies across models. We identify three failure modes invisible to single-turn evaluations: utility lock-in, where a model rarely updates despite clarification; unsafe recovery, where a model updates at disproportionate safety cost; and repetitive recovery, where a model recycles prior responses rather than providing new information. Moreover, conversations converge to similar harmfulness levels regardless of how conservative the model starts. These findings expose a gap that single-turn evaluations miss -- whether a model is appropriately cautious or simply unresponsive to clarified user intent.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!