2604.21277v1 Apr 23, 2026 cs.AI

MLLM은 '누락된' 내용을 이해할 수 있는가?

Can MLLMs "Read" What is Missing?

Chaozheng Huang

Citations: 3

h-index: 1

Jin Guo

Citations: 0

h-index: 0

Xi Fang

Citations: 9

h-index: 2

본 논문에서는 MMTR-Bench를 소개합니다. MMTR-Bench는 다중 모드 대규모 언어 모델(MLLM)이 시각적 맥락으로부터 직접 마스크된 텍스트를 재구성하는 고유한 능력을 평가하기 위해 설계된 벤치마크입니다. 기존의 질의응답 작업과는 달리, MMTR-Bench는 명시적인 프롬프트를 제거하고, 모델이 문서 및 웹페이지와 같은 실제 도메인의 단일 페이지 또는 다중 페이지 입력에서 마스크된 텍스트를 복원하도록 요구합니다. 이러한 설계는 재구성 작업을 지시 사항 준수 능력으로부터 분리하여, 모델의 레이아웃 이해, 시각적 기반, 지식 통합 능력을 직접적으로 평가할 수 있도록 합니다. MMTR-Bench는 여러 언어와 다양한 대상 길이를 포괄하는 2,771개의 테스트 샘플로 구성되어 있습니다. 이러한 다양성을 고려하여, 우리는 수준별 평가 프로토콜을 제안합니다. 대표적인 MLLM에 대한 실험 결과, 벤치마크가 상당한 어려움을 제시한다는 것을 보여주며, 특히 문장 및 단락 수준의 재구성에 어려움을 겪는 것으로 나타났습니다. 벤치마크 홈페이지는 https://mmtr-bench-dataset.github.io/MMTR-Bench/ 에서 확인할 수 있습니다.

Original Abstract

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!