2601.01627v1 Jan 04, 2026 cs.CL

JMedEthicBench: 일본어 대규모 언어 모델의 의료 안전성을 평가하기 위한 다중 회화 벤치마크

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Yusuke Iwasawa

Citations: 10,187

h-index: 21

Yutaka Matsuo

Citations: 1,364

h-index: 11

Junyu Liu

Citations: 283

h-index: 9

Qian Niu

Citations: 323

h-index: 10

Zequn Zhang

Citations: 12

h-index: 1

Y. Xun

Citations: 2

h-index: 1

Wenlong Hou

Citations: 5

h-index: 1

Shujun Wang

Citations: 2

h-index: 1

Kan Hatakeyama-Sato

Citations: 105

h-index: 5

Zirui Li

Citations: 404

h-index: 5

대규모 언어 모델(LLM)이 의료 분야에 점점 더 많이 활용됨에 따라, 임상 사용 전에 이러한 모델의 의료 안전성을 신중하게 평가하는 것이 중요합니다. 그러나 기존 안전성 벤치마크는 대부분 영어 중심이며, 다중 회화적인 임상 상담을 고려하지 않고 단일 질문에 대한 답변만을 평가합니다. 이러한 격차를 해소하기 위해, 우리는 일본 의료 환경에서 LLM의 의료 안전성을 평가하기 위한 최초의 다중 회화 벤치마크인 JMedEthicBench를 소개합니다. 본 벤치마크는 일본 의료 협회의 67가지 지침을 기반으로 하며, 7가지 자동으로 발견된 탈옥 전략을 사용하여 생성된 5만 건 이상의 대립적인 대화 데이터를 포함합니다. 이중 LLM 평가 프로토콜을 사용하여 27개의 모델을 평가한 결과, 상업용 모델은 높은 수준의 안전성을 유지하는 반면, 의료 전문 모델은 더 큰 취약성을 보이는 것으로 나타났습니다. 또한, 대화 턴이 증가함에 따라 안전성 점수가 크게 감소했습니다(중앙값: 9.5에서 5.0, p < 0.001). 본 벤치마크의 일본어 및 영어 버전 모두에 대한 교차 언어 평가는 의료 모델의 취약성이 언어에 국한되지 않고 내재적인 정렬(alignment)의 한계를 나타낸다는 것을 보여줍니다. 이러한 결과는 도메인 특화 미세 조정이 의도치 않게 안전 메커니즘을 약화시킬 수 있으며, 다중 회화 상호 작용은 별도의 위협 요소를 나타내므로, 이에 대한 전용 정렬 전략이 필요하다는 것을 시사합니다.

Original Abstract

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

1 Citations

1 Influential

10.5 Altmetric

55.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!