2602.20442v1 Feb 24, 2026 cs.LG

희소 전자 건강 기록에서 알 수 없는 결측값 추론

Imputation of Unknown Missingness in Sparse Electronic Health Records

S. Batra

Citations: 3,677

h-index: 13

Robert E. Tillman

Citations: 19

h-index: 2

Junying Han

Citations: 0

h-index: 0

Josue Nassar

Citations: 191

h-index: 6

A. Córdova-Palomera

Citations: 65

h-index: 3

V. Nori

Citations: 993

h-index: 13

머신러닝은 의학 분야 발전에 큰 잠재력을 가지고 있으며, 전자 건강 기록(EHR)은 주요 데이터 소스로 활용됩니다. 그러나 EHR은 데이터 수집 및 의료 제공자 간 데이터 공유의 다양한 어려움과 한계로 인해 종종 희소하며 결측값을 포함합니다. 기존의 결측값 추정 기법은 주로 알려진 결측값, 즉 검사 결과 값의 누락 또는 사용 불가능과 같은 경우에 초점을 맞추며, 무엇이 누락되었는지 명확하게 구별하기 어려운 경우를 명시적으로 다루지 않습니다. 예를 들어, EHR에서 누락된 진단 코드는 환자가 해당 질환으로 진단받지 않았음을 의미할 수도 있고, 진단이 이루어졌지만 의료 제공자가 공유하지 않았음을 의미할 수도 있습니다. 이러한 상황은 '알 수 없는 결측값'의 범주에 속합니다. 이러한 문제를 해결하기 위해, 우리는 이진 EHR에서 알 수 없는 결측값을 복구하기 위한 데이터 정제 알고리즘을 개발했습니다. 데이터가 누락되었다고 예측되는 경우 값을 복구하기 위해 출력을 적응적으로 임계값 처리하는 트랜스포머 기반의 노이즈 제거 신경망을 설계했습니다. 우리의 결과는 실제 EHR 데이터 세트 내에서 기존의 추정 방법보다 의료 코드를 정제하는 데 정확도가 향상되었으며, 정제된 데이터를 사용하여 후속 작업의 성능이 향상되었음을 보여줍니다. 특히, 실제 응용 분야인 EHR을 이용한 입원 재입원 예측에 우리의 방법을 적용했을 때, 기존의 모든 기준 모델보다 통계적으로 유의미한 성능 향상을 달성했습니다.

Original Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!