2604.21345v1 Apr 23, 2026 cs.AI

재사용 가능한 교차 도메인 파이프라인을 활용한 AI 회의 요약 평가

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

J. Zhang

Citations: 20

h-index: 1

Philip Zhong

Citations: 0

h-index: 0

Don Wang

Citations: 0

h-index: 0

Kent Chen

Citations: 28

h-index: 1

본 논문에서는 생성형 AI 애플리케이션을 위한 재사용 가능한 평가 파이프라인을 제시하고, 이를 AI 회의 요약에 적용하여 Dataset Pipeline에서 파생된 공개 아티팩트 패키지와 함께 제공합니다. 이 시스템은 5단계(소스 입력, 구조화된 참조 생성, 후보 생성, 구조화된 평가, 보고)에 걸쳐 재사용 가능한 오케스트레이션과 작업별 의미를 분리합니다. 기존의 독립적인 평가 시스템과 달리, 본 시스템은 ground truth와 평가자 결과를 모두 유형화된 지속적인 아티팩트로 취급하여 집계, 문제 분석 및 통계적 검정을 가능하게 합니다. 본 연구에서는 city_council, private_data, whitehouse_press_briefings 데이터를 포함하는 114개의 회의 데이터 세트를 사용하여 gpt-4.1-mini, gpt-5-mini, gpt-5.1 모델을 평가했습니다. 그 결과, gpt-4.1-mini는 평균 정확도(0.583)가 가장 높았으며, gpt-5.1은 완전성(0.886)과 보장도(0.942)에서 우수한 성능을 보였습니다. Holm 보정을 사용한 쌍체 부호 검정 결과, 정확도 측면에서는 통계적으로 유의미한 차이가 없었지만, gpt-5.1은 유의미하게 높은 유지율을 보였습니다. 유형화된 DeepEval 기반의 비교 실험에서는 gpt-5.1이 유지율 측면에서 우수한 성능을 보였지만, 전반적인 정확도 측면에서는 낮은 결과를 보였습니다. 이는 참조 기반의 평가가 claim-grounded 평가에서 감지할 수 있는 특정 오류를 놓칠 수 있음을 시사합니다. 유형화된 분석 결과, whitehouse_press_briefings 데이터는 정확도를 평가하기 어려운 영역이며, 근거 없는 구체적인 내용이 자주 포함되어 있음을 확인했습니다. 배포 후 분석 결과, 동일한 프로토콜 하에서 gpt-5.4가 모든 지표에서 gpt-4.1보다 우수한 성능을 보였으며, 특히 유지율 지표에서 통계적으로 유의미한 개선을 보였습니다. 본 시스템은 오프라인 평가 루프를 벤치마킹하고 문서화하지만, 온라인 피드백-평가 경로에 대한 정량적 평가는 수행하지 않았습니다.

Original Abstract

We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!