2603.16738v1 Mar 17, 2026 cs.AI

MedCL-Bench: 생물 의학 지속 학습에서의 안정성-효율성 균형 및 확장성에 대한 벤치마킹

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Zaifu Zhan

Citations: 251

h-index: 8

Min Zeng

Citations: 24

h-index: 3

Shuang Zhou

Citations: 93

h-index: 6

Rui Zhang

Citations: 66

h-index: 5

의료 언어 모델은 새로운 증거와 용어의 변화에 따라 업데이트되어야 하지만, 순차적인 업데이트는 파국적인 망각을 유발할 수 있습니다. 생물 의학 자연어 처리가 많은 정적 벤치마크를 가지고 있지만, 표준화된 프로토콜 하에서 지속 학습을 평가하고, 작업 순서에 대한 강건성을 측정하며, 컴퓨팅 자원을 고려한 보고를 제공하는 통합적이고 다양한 작업 벤치마크는 존재하지 않습니다. 본 논문에서는 열 개의 생물 의학 자연어 처리 데이터셋을 포함하고, 다섯 가지 작업 유형을 아우르는 MedCL-Bench를 소개합니다. 이 벤치마크는 11가지의 지속 학습 전략을 8가지의 작업 순서로 평가하고, 유지율, 전이 성능, GPU 시간 비용을 보고합니다. 다양한 모델 구조와 작업 순서에서, 새로운 작업에 대한 직접적인 순차적 미세 조정은 파국적인 망각을 유발하여 이전 작업에 대한 성능 저하를 초래합니다. 지속 학습 방법은 유지율과 컴퓨팅 자원 간의 뚜렷한 균형을 나타냅니다. 파라미터 격리는 GPU 시간당 가장 우수한 유지율을 제공하며, 리플레이는 높은 비용으로 강력한 보호 기능을 제공하고, 정규화는 제한적인 이점을 제공합니다. 망각 현상은 작업에 따라 다르며, 다중 레이블 주제 분류는 가장 취약하고, 제한된 출력 작업은 더 강건합니다. MedCL-Bench는 모델 업데이트를 배포하기 전에 감사할 수 있는 재현 가능한 프레임워크를 제공합니다.

Original Abstract

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!