2601.18496v1 Jan 26, 2026 cs.AI

DEEPMED: 멀티 홉 의료 검색 데이터와 턴 제어 에이전트 훈련 및 추론을 통한 의료 DeepResearch 에이전트 구축

DEEPMED: Building a Medical DeepResearch Agent via Multi-hop Med-Search Data and Turn-Controlled Agentic Training & Inference

Zihan Wang

Citations: 224

h-index: 3

Hao Wang

Citations: 0

h-index: 0

Shi Feng

Citations: 67

h-index: 5

Xiaocui Yang

Citations: 682

h-index: 12

Daling Wang

Citations: 2,384

h-index: 24

Yiqun Zhang

Citations: 151

h-index: 7

Jinghao Lin

Citations: 251

h-index: 3

Haihua Yang

Citations: 2

h-index: 1

Xiaozhong Ji

Citations: 0

h-index: 0

의료 추론 모델은 매개변수적 지식(parametric knowledge)에 제한되어 있어 망각과 환각(hallucination) 현상에 취약하다. DeepResearch(DR) 모델은 도구로부터 얻은 검증 가능한 증거에 기반하여 답변을 생성함으로써 일반 도메인에서는 강력한 성능을 보이지만, 의료 분야에 직접 적용할 경우 그 효과가 상대적으로 제한적이다. 우리는 이를 작업 특성과 도구 사용 확장성(scaling)이라는 두 가지 격차에 기인한다고 분석한다. 의료 질문은 지식 집약적인 임상 문맥 내에서의 증거 해석을 필요로 하는데, 일반적인 DR 모델은 정보 검색은 가능하나 임상 문맥적 추론 능력이 부족하여 "정보를 찾고도 활용하지 못하는(find it but fail to use it)" 경우가 많아 성능이 제한된다. 또한, 의료 시나리오에서 무분별한 도구 호출 확장은 잡음이 섞인 문맥을 주입하여 민감한 의료 추론을 저해하고 잘못된 경로로의 반복적인 증거 탐색을 유발할 수 있다. 이에 우리는 DeepMed를 제안한다. 데이터 측면에서는 모델이 의료 문맥에서 DR 패러다임을 적용할 수 있도록 돕는 멀티 홉 의료 검색 QA 합성 방법을 사용한다. 훈련 측면에서는 과도한 도구 호출 증가를 억제하기 위해 난이도 인식 턴 페널티(difficulty-aware turn-penalty)를 도입한다. 추론 측면에서는 통제된 단계 수 내에서 가설을 검증하고 문맥 오염(context rot)을 방지하는 모니터를 도입한다. 결과적으로 7개의 의료 벤치마크에서 DeepMed는 기본 모델 대비 평균 9.79%의 성능 향상을 기록했으며, 더 큰 규모의 의료 추론 및 DR 모델들을 능가하였다.

Original Abstract

Medical reasoning models remain constrained by parametric knowledge and are thus susceptible to forgetting and hallucinations. DeepResearch (DR) models ground outputs in verifiable evidence from tools and perform strongly in general domains, but their direct transfer to medical field yields relatively limited gains. We attribute this to two gaps: task characteristic and tool-use scaling. Medical questions require evidence interpretation in a knowledge-intensive clinical context; while general DR models can retrieve information, they often lack clinical-context reasoning and thus "find it but fail to use it," leaving performance limited by medical abilities. Moreover, in medical scenarios, blindly scaling tool-call can inject noisy context, derailing sensitive medical reasoning and prompting repetitive evidence-seeking along incorrect paths. Therefore, we propose DeepMed. For data, we deploy a multi-hop med-search QA synthesis method supporting the model to apply the DR paradigm in medical contexts. For training, we introduce a difficulty-aware turn-penalty to suppress excessive tool-call growth. For inference, we bring a monitor to help validate hypotheses within a controlled number of steps and avoid context rot. Overall, on seven medical benchmarks, DeepMed improves its base model by 9.79\% on average and outperforms larger medical reasoning and DR models.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!