2602.21165v1 Feb 24, 2026 cs.CL

PVminer: 환자 생성 데이터에서 환자의 목소리를 탐지하는 도메인 특화 도구

PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data

S. Fodeh

Citations: 698

h-index: 17

Linhai Ma

Citations: 8

h-index: 2

Srivani Talakokkul

Citations: 7

h-index: 2

Ganesh Puthiaraju

Citations: 5

h-index: 1

Afshan Khan

Citations: 5

h-index: 1

Sarah Lowe

Citations: 3

h-index: 1

A. Roundtree

Citations: 403

h-index: 10

Yan Wang

Citations: 4

h-index: 1

Ashley K. Hagaman

Citations: 2,129

h-index: 26

보안 메시지, 설문 조사 및 인터뷰와 같은 환자 생성 텍스트는 환자의 목소리(PV)를 풍부하게 반영하며, 이는 의사소통 행동과 건강에 영향을 미치는 사회적 요인(SDoH)을 나타냅니다. 기존의 질적 코딩 프레임워크는 노동 집약적이며, 여러 의료 시스템에 걸쳐 방대한 양의 환자 작성 메시지에 적용하기 어렵습니다. 기존의 머신러닝(ML) 및 자연어 처리(NLP) 접근 방식은 부분적인 해결책을 제공하지만, 종종 환자 중심 의사소통(PCC)과 SDoH를 별개의 작업으로 취급하거나, 환자와의 상호작용에 적합하지 않은 모델에 의존합니다. 본 연구에서는 환자와 의료 제공자 간의 보안 의사소통에서 환자의 목소리를 체계화하기 위한 도메인 맞춤형 NLP 프레임워크인 PVminer를 소개합니다. PVminer는 환자별 BERT 인코더(PV-BERT-base 및 PV-BERT-large), 주제 모델링을 통한 주제 보강(PV-Topic-BERT), 그리고 코드, 하위 코드, 그리고 조합 레벨 레이블에 대한 미세 조정된 분류기를 통합하여 환자의 목소리 탐지를 다중 레이블, 다중 클래스 예측 작업으로 정의합니다. 주제 표현은 미세 조정 및 추론 과정에서 의미론적 입력 데이터를 풍부하게 하기 위해 통합됩니다. PVminer는 계층적 작업에서 뛰어난 성능을 보이며, 생물 의학 및 임상 사전 훈련 모델을 능가하는 F1 점수를 달성했습니다 (코드: 82.25%, 하위 코드: 80.14%, 조합: 최대 77.87%). 추가 분석 결과, 작성자 식별 및 주제 기반 보강이 각각 의미 있는 성능 향상에 기여하는 것으로 나타났습니다. 사전 훈련된 모델, 소스 코드 및 문서는 공개적으로 배포될 예정이며, 주석이 달린 데이터 세트는 연구 목적으로 요청 시 제공됩니다.

Original Abstract

Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.

1 Citations

0 Influential

13 Altmetric

66.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!