2603.13168v1 Mar 13, 2026 cs.AI

모성 건강 관리를 지원하는 챗봇 개발 및 평가

Developing and evaluating a chatbot to support maternal health care

Smriti Jha

Citations: 192

h-index: 3

Vidhi Jain

Citations: 1

h-index: 1

Jianyu Xu

Citations: 29

h-index: 2

Grace Liu

Citations: 114

h-index: 3

Sowmya Ramesh

Citations: 1

h-index: 1

Jitender Nagpal

Citations: 16

h-index: 2

Gretchen B. Chapman

Citations: 610

h-index: 11

Ben Bellows

Citations: 2

h-index: 1

Siddhartha Goyal

Citations: 19

h-index: 3

Aarti Singh

Citations: 3

h-index: 1

Bryan Wilder

Citations: 2

h-index: 1

전화 기반 챗봇을 통해 신뢰할 수 있는 모성 건강 정보를 제공하는 것은, 특히 건강 문해력이 낮고 의료 접근성이 제한적인 환경에서 큰 영향을 미칠 수 있습니다. 그러나 이러한 시스템을 구축하는 것은 기술적으로 어려움이 많습니다. 사용자 질문은 짧고, 명확하지 않으며, 여러 언어가 혼합되어 사용되고, 답변에는 지역적 맥락에 대한 이해가 필요하며, 불완전하거나 누락된 증상 정보는 안전한 의사 결정을 어렵게 만듭니다. 본 논문에서는 학계 연구자, 헬스케어 기술 회사, 공공 보건 비영리 단체, 그리고 병원이 협력하여 개발한 인도 지역의 모성 건강 챗봇을 소개합니다. 이 시스템은 (1) 단계별 트riage 기능을 통해 고위험 질문을 전문가 템플릿으로 연결하고, (2) 큐레이션된 모성/신생아 지침을 기반으로 한 하이브리드 검색 기능을 사용하며, (3) LLM(Large Language Model)을 통해 증거 기반의 답변을 생성합니다. 본 논문의 핵심 기여는 제한된 전문가 감독 하에 고위험 환경에 시스템을 배포하기 위한 평가 프로세스입니다. 시스템의 구성 요소 수준 및 전체 시스템 성능을 평가하기 위해 다음과 같은 방법을 사용했습니다. (i) 150개의 데이터를 사용하여 라벨링된 트riage 벤치마크를 구축하여 86.7%의 응급 상황 탐지율을 달성했으며, 오탐과 누락 간의 균형을 명시적으로 보고합니다. (ii) 100개의 데이터를 사용하여 청크 단위 증거 라벨링이 포함된 합성 멀티-증거 검색 벤치마크를 구축했습니다. (iii) 781개의 실제 질문을 사용하여 LLM을 평가하고, 의료 전문가가 설계한 기준을 적용했습니다. (iv) 전문가 검증을 수행했습니다. 연구 결과는 다국어 환경에서 신뢰할 수 있는 의료 지원 시스템을 구축하려면 심층적인 설계와 다양한 평가 방법을 결합해야 하며, 특정 모델이나 평가 방법만으로는 충분하지 않다는 것을 보여줍니다.

Original Abstract

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!