2601.10108v1 Jan 15, 2026 cs.CL

SIN-Bench: 장문 맥락의 다중 모달 과학 문헌에서 원본 증거 연결 추적

SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yu Qiao

Citations: 6

h-index: 2

Zihan Wang

Citations: 555

h-index: 8

Ruihang Chu

Citations: 101

h-index: 6

Yiming Ren

Citations: 1,664

h-index: 4

Junjie Wang

Citations: 13

h-index: 2

Yu Meng

Citations: 49

h-index: 4

Yihan Shi

Citations: 17

h-index: 2

Zhiqiang Lin

Citations: 5

h-index: 1

Yiran Xu

Citations: 204

h-index: 4

Yunfei Zhao

Citations: 735

h-index: 12

Ruiming Tang

Citations: 2

h-index: 1

Minghao Liu

Citations: 283

h-index: 3

Yujiu Yang

Citations: 60

h-index: 5

Ziming Li

Citations: 414

h-index: 11

다중 모달 대규모 언어 모델이 실제로 장문의 과학 논문을 이해하는지 평가하는 것은 여전히 어려운 과제입니다. 답변만을 평가하는 지표와 인위적인 "바늘 찾기" 테스트는 종종 문서 내에서 인과 관계에 기반한 증거 연결 추론 과정을 요구하지 않고도 답변 일치 여부에 따라 점수를 부여하는 경향이 있습니다. 우리는 모델이 원본 과학 문서 내에서 명시적인 다중 모달 증거 연결망을 구축하도록 요구하는 "바다 속 물고기" (FITO) 패러다임을 제안합니다. FITO를 실현하기 위해, 우리는 텍스트와 그림의 원래 구조를 유지하는 과학 융합 코퍼스인 SIN-Data를 구축했습니다. 그 위에, 우리는 증거 발견 (SIN-Find), 가설 검증 (SIN-Verify), 근거 기반 질의 응답 (SIN-QA), 그리고 증거 기반 요약 (SIN-Summary)을 포괄하는 네 가지 단계적 과제를 포함하는 SIN-Bench를 구축했습니다. 또한, 우리는 검증 가능한 근거에 기반한 예측에만 점수를 부여하고, 일치성, 관련성, 논리성을 통해 증거의 품질을 진단하는 "증거 없이는 점수 없음" (No Evidence, No Score) 방식을 도입했습니다. 8개의 MLLM에 대한 실험 결과, 근거 기반 연결이 주요 병목 현상임을 보여주었습니다. Gemini-3-pro는 가장 높은 평균 종합 점수(0.573)를 달성했지만, GPT-5는 SIN-QA에서 가장 높은 답변 정확도(0.767)를 보였지만, 증거 기반의 종합 점수에서는 성능이 저조하여, 정확성과 추적 가능한 근거 사이의 격차를 드러냈습니다.

Original Abstract

Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

1 Citations

0 Influential

6 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!