2602.06526v1 Feb 06, 2026 cs.CL

누락된 어노테이션 완성: 정확하고 확장 가능한 IR 벤치마크 평가를 위한 다중 에이전트 토론

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Minjeong Ban

Citations: 21

h-index: 2

Jeonghwan Choi

Citations: 14

h-index: 2

Hyangsuk Min

Citations: 59

h-index: 5

Nicole Hee-Yeon Kim

Citations: 17

h-index: 2

Minseok Kim

Citations: 8

h-index: 1

Jae-Gil Lee

Citations: 2,848

h-index: 15

Hwanjun Song

Citations: 11

h-index: 2

정보 검색(IR) 평가는 레이블이 없는 관련 문서 조각을 포함하는 불완전한 IR 벤치마크 데이터 세트 때문에 여전히 어려운 과제입니다. LLM 및 LLM-인간 하이브리드 전략은 비용이 많이 드는 인간의 노력을 줄여주지만, LLM의 과신 및 비효율적인 AI-인간 상호 작용 문제를 안고 있습니다. 이러한 문제를 해결하기 위해, 우리는 LLM 에이전트를 활용하여 상반된 초기 입장을 기반으로 반복적인 상호 비판을 통해 관련성을 평가하는 다중 라운드 토론 기반 프레임워크인 DREAM을 제안합니다. DREAM은 합의 기반 토론을 통해 특정 경우에는 더욱 정확한 레이블링을 제공하고, 불확실한 경우에는 더욱 신뢰할 수 있는 AI-인간 상호 작용을 가능하게 하며, 3.5%의 인간 참여만으로 95.2%의 레이블링 정확도를 달성합니다. DREAM을 사용하여, 평가 편향을 완화하고 29,824개의 누락된 관련 문서 조각을 발견하여 더욱 공정한 검색 시스템 비교를 가능하게 하는 개선된 벤치마크인 BRIDGE를 구축했습니다. 우리는 DREAM을 사용하여 IR 시스템을 재평가하고, RAG 시스템에 대한 평가를 확장했으며, 해결되지 않은 문제점들이 검색 시스템 순위를 왜곡할 뿐만 아니라 검색-생성 불일치를 초래한다는 것을 보여주었습니다. 관련성 평가 프레임워크는 https://github.com/DISL-Lab/DREAM-ICLR-26 에서, BRIDGE 데이터 세트는 https://github.com/DISL-Lab/BRIDGE-Benchmark 에서 이용하실 수 있습니다.

Original Abstract

Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

1 Citations

0 Influential

35.547189562171 Altmetric

178.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!