2602.15852v2 Jan 24, 2026 cs.CL

시간적 누수 제약 조건 하에서 안전하고 배포 가능한 임상 자연어 처리 시스템 구축

Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

Kai Zheng

Citations: 46

h-index: 3

Ha Na Cho

Citations: 2

h-index: 1

Alexander Lopez

Citations: 3

h-index: 1

Hansen Bow

Citations: 34

h-index: 3

Sairam Sutari

Citations: 1

h-index: 1

임상 자연어 처리(NLP) 모델은 서술형 임상 문서를 활용하여 병원 퇴원 계획 수립을 지원하는 데 유망한 결과를 보여주었습니다. 그러나 노트 기반 모델은 특히 시간적 및 어휘적 누수 문제에 취약하며, 이는 문서에 미래의 임상 결정이 포함되어 예측 성능을 과장할 수 있음을 의미합니다. 이러한 현상은 실제 배포 환경에서 심각한 위험을 초래할 수 있으며, 과도하게 낙관적이거나 시간적으로 유효하지 않은 예측은 임상 워크플로우를 방해하고 환자 안전을 위협할 수 있습니다. 본 연구는 시간적 누수 제약 조건 하에서 안전하고 배포 가능한 임상 NLP 시스템을 구축하기 위한 시스템 수준 설계 방안에 중점을 둡니다. 모델 개발 과정에서 해석 가능성을 통합하여 누수 가능성이 높은 신호를 식별하고 최종 훈련 전에 억제하는 가벼운 감사 파이프라인을 제시합니다. 전형적인 척추 수술 후 다음 날 퇴원 예측을 사례 연구로 활용하여, 감사 기능이 예측 성능, 보정 및 안전 관련 트레이드오프에 미치는 영향을 평가했습니다. 결과에 따르면 감사된 모델은 더 보수적이고 정확하게 보정된 확률 추정치를 제공하며, 퇴원 관련 어휘적 단서에 대한 의존도가 감소했습니다. 이러한 결과는 실제 배포가 가능한 임상 NLP 시스템이 낙관적인 성능보다 시간적 유효성, 보정 및 행동적 강건성을 우선시해야 함을 강조합니다.

Original Abstract

Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!