2603.11394v1 Mar 12, 2026 cs.CL

저를 계속 듣지 마세요! 다중 턴 대화가 진단 추론 능력을 저하시키는 방법

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Juming Xiong

Citations: 128

h-index: 6

Kevin H. Guo

Citations: 3

h-index: 1

Avinash Baidya

Citations: 76

h-index: 5

Katherine E. Brown

Citations: 53

h-index: 3

Zhijun Yin

Citations: 239

h-index: 8

Bradley Malin

Citations: 40

h-index: 3

Xiang Gao

Citations: 88

h-index: 5

Chao Yan

Citations: 74

h-index: 4

환자와 의료진은 점점 더 많은 양의 대규모 언어 모델(LLM) 기반 챗봇을 사용하여 의료 관련 문의를 하고 있습니다. 최첨단 LLM은 정적인 진단 추론 벤치마크에서 높은 성능을 보이지만, 실제 사용 환경을 더 잘 반영하는 다중 턴 대화에서의 효능은 충분히 연구되지 않았습니다. 본 논문에서는 세 개의 임상 데이터 세트를 사용하여 17개의 LLM을 평가하고, 의사 결정 공간을 여러 개의 간단한 대화 턴으로 나누는 것이 모델의 진단 추론에 미치는 영향을 조사합니다. 특히, 모델의 확신도(정확한 진단을 옹호하거나 잘못된 제안에 대한 안전한 회피)와 유연성(정확한 제안이 제시되었을 때 이를 인식하는 능력)을 측정하기 위한 "고수 또는 변경(stick-or-switch)" 평가 프레임워크를 개발했습니다. 우리의 실험 결과, "대화 비용(conversation tax)"이 발생하여 다중 턴 상호 작용은 단일 턴 기준과 비교하여 성능이 일관적으로 저하되는 것을 확인했습니다. 주목할 만한 점은, 모델이 종종 초기 진단과 안전한 회피를 포기하고 잘못된 사용자 제안에 동조하는 경향이 있다는 것입니다. 또한, 일부 모델은 신호와 잘못된 제안을 구별하지 못하는 "맹목적인 변경(blind switching)" 현상을 보입니다.

Original Abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!