2601.06676v1 Jan 10, 2026 cs.CL

IDRBench: 대화형 심층 연구 벤치마크

IDRBench: Interactive Deep Research Benchmark

Yingchaojie Feng

Citations: 435

h-index: 11

Zhaorui Yang

Zhejiang University

Citations: 127

h-index: 3

Jun Yu

Citations: 29

h-index: 4

Wei Chen

Citations: 3,368

h-index: 6

Anthony K. H. Tung

Citations: 105

h-index: 6

Qiang Huang

Harbin Institute of Technology (Shenzhen)

Citations: 792

h-index: 14

Xiaoyan Xie

Citations: 40

h-index: 3

대규모 언어 모델(LLM)을 기반으로 하는 심층 연구 에이전트는 다단계 추론, 웹 탐색 및 장문의 보고서 생성과 같은 작업을 수행할 수 있습니다. 그러나 대부분의 기존 시스템은 완전하게 명시된 사용자 의도를 가정하고 최종 결과만을 평가하는 자율적인 방식으로 작동합니다. 실제로는 연구 목표가 종종 불명확하게 정의되고 탐색 과정에서 변화하므로, 지속적인 상호작용은 안정적인 결과를 얻기 위해 필수적입니다. 이러한 중요성에도 불구하고, 기존의 심층 연구 벤치마크는 상호작용을 거의 고려하지 않으며, 동적인 사용자 피드백을 모델링하거나 상호작용 비용을 정량화하지 않습니다. 본 연구에서는 심층 연구의 상호작용을 체계적으로 평가하기 위한 첫 번째 벤치마크인 IDRBench를 소개합니다. IDRBench는 모듈식 다중 에이전트 연구 프레임워크, 필요에 따른 상호작용 기능, 확장 가능한 참조 기반 사용자 시뮬레이터, 그리고 상호작용의 효과(품질 및 정렬)와 비용(턴 수 및 토큰 수)을 동시에 측정하는 평가 도구를 결합합니다. 7개의 최첨단 LLM을 대상으로 실시한 실험 결과, 상호작용은 일관되게 연구의 품질과 안정성을 향상시키며, 종종 모델의 용량 차이를 능가하는 효과를 보입니다. 또한, 상호작용 효율성 측면에서 상당한 트레이드오프가 발생하는 것을 확인했습니다.

Original Abstract

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

2 Citations

0 Influential

7 Altmetric

37.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!