2604.10741v2 Apr 12, 2026 cs.CL

Deep-Reporter: 실증 기반 다중 모드 장문 생성 시스템을 위한 심층 연구

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Shuicheng Yan

Citations: 59

h-index: 5

Yuxin Hu

Citations: 3

h-index: 1

Jianzhu Bao

Citations: 6

h-index: 2

Fangda Ye

Citations: 3

h-index: 1

Zhifei Xie

Citations: 102

h-index: 2

Yihang Yin

Citations: 445

h-index: 5

Shurui Huang

Citations: 4

h-index: 1

Shi Dong

Citations: 8

h-index: 2

최근의 에이전트 기반 검색 프레임워크는 반복적인 계획 및 검색을 통해 심층적인 연구를 가능하게 하며, 환각 현상을 줄이고 사실 기반의 정확성을 높입니다. 하지만 이러한 프레임워크는 여전히 텍스트 중심적인 경향이 있으며, 실제 전문가 보고서에서 나타나는 다중 모드 정보를 간과합니다. 본 연구에서는 다중 모드 장문 생성이라는 중요한 과제를 제안합니다. 이에 따라, 우리는 다중 모드 정보의 실증적 기반을 갖춘 장문 생성 시스템을 위한 통합 에이전트 프레임워크인 Deep-Reporter를 제안합니다. Deep-Reporter는 다음과 같은 기능을 수행합니다: (i) 에이전트 기반의 다중 모드 검색 및 필터링을 통해 텍스트 구절과 정보가 풍부한 시각 자료를 검색하고 필터링합니다; (ii) 체크리스트 기반의 점진적인 합성 과정을 통해 일관성 있는 이미지-텍스트 통합과 최적의 인용 위치를 보장합니다; (iii) 반복적인 컨텍스트 관리를 통해 장거리 일관성과 지역적 유창성의 균형을 맞춥니다. 우리는 모델 최적화를 위한 8,000개의 고품질 에이전트 추적 데이터를 생성하는 엄격한 데이터 큐레이션 파이프라인을 개발했습니다. 또한, 9개의 분야에 걸쳐 247개의 연구 과제를 포함하고 안정적인 다중 모드 환경을 제공하는 종합적인 테스트베드인 M2LongBench를 소개합니다. 광범위한 실험 결과, 다중 모드 장문 생성은 특히 다중 모드 정보의 선택 및 통합 측면에서 어려운 과제이며, 효과적인 추가 학습을 통해 이러한 격차를 줄일 수 있음을 확인했습니다.

Original Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!