2604.20441v1 Apr 22, 2026 cs.AI

MedSkillAudit: 의료 연구 에이전트 기술을 위한 도메인 특화 감사 프레임워크

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Fei Sun

Citations: 2

h-index: 1

Bo-Sheng Huang

Citations: 0

h-index: 0

Yingyong Hou

Citations: 76

h-index: 4

Xinyuan Lao

Citations: 117

h-index: 4

Huimei Wang

Citations: 3

h-index: 1

Qi Yao

Citations: 15

h-index: 2

Yu Lv

Citations: 5

h-index: 1

Weiqi Lei

Citations: 1,767

h-index: 10

Pengfei Xia

Citations: 96

h-index: 2

Zhujun Tan

Citations: 0

h-index: 0

Shengyang Xie

Citations: 20

h-index: 2

Wei Chen

Citations: 2

h-index: 1

Xueqi Wen

Citations: 845

h-index: 9

배경: 에이전트 기술은 AI 에이전트 시스템에서 모듈화되고 재사용 가능한 기능 단위로 점점 더 많이 활용되고 있습니다. 의료 연구 에이전트 기술은 일반적인 평가 외에도 과학적 정직성, 방법론적 타당성, 재현성 및 안전성 확보와 같은 추가적인 보호 장치가 필요합니다. 본 연구에서는 전문가 검토에 대한 신뢰성을 중심으로 의료 연구 에이전트 기술을 위한 도메인 특화 감사 프레임워크를 개발하고 예비적으로 평가했습니다. 방법: 우리는 배포 전에 기술 출시 준비 상태를 평가하는 계층화된 프레임워크인 MedSkillAudit (skill-auditor@1.0)을 개발했습니다. 우리는 5가지 의료 연구 분야(각 분야 15개)의 75개 기술을 평가했습니다. 두 명의 전문가가 독립적으로 품질 점수(0-100), 순위 기반 출시 결정(Production Ready / Limited Release / Beta Only / Reject), 그리고 고위험 실패 여부를 평가했습니다. 시스템과 전문가 간의 일치도는 ICC(2,1) 및 선형 가중 Cohen's kappa를 사용하여 정량화하고, 인간 간 평가 기준과 비교했습니다. 결과: 평균 합의 품질 점수는 72.4 (SD = 13.0)였으며, 57.3%의 기술이 제한적 출시 기준 이하로 평가되었습니다. MedSkillAudit은 ICC(2,1) = 0.449 (95% CI: 0.250-0.610)을 달성하여, 인간 간 평가의 ICC인 0.300을 상회했습니다. 시스템 합의 점수 편차(SD = 9.5)는 전문가 간 편차(SD = 12.4)보다 작았으며, 방향성 편향은 없었습니다 (Wilcoxon p = 0.613). 프로토콜 설계 분야에서 가장 높은 수준의 분야별 일치도(ICC = 0.551)를 보였으며, 학술적 글쓰기 분야에서는 음의 ICC(-0.567)를 나타내어, 구조적 평가 기준과 전문가 간의 불일치를 반영했습니다. 결론: 도메인 특화 사전 배포 감사 프레임워크는 의료 연구 에이전트 기술을 관리하는 데 실질적인 기반을 제공하며, 일반적인 품질 검사와 함께 과학적 사용 사례에 맞게 설계된 체계적인 감사 워크플로우를 통해 보완할 수 있습니다.

Original Abstract

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!