2604.03660v1 Apr 04, 2026 cs.AI

TableVision: 복잡한 계층 구조 테이블에 대한 공간적 추론을 위한 대규모 벤치마크

TableVision: A Large-Scale Benchmark for Spatially Grounded Reasoning over Complex Hierarchical Tables

Lu Dai

Citations: 60

h-index: 5

Hanqing Wang

Citations: 43

h-index: 3

Xiaoyu Chen

Citations: 140

h-index: 4

Zhuoyu Li

Citations: 0

h-index: 0

W. Dai

Citations: 27

h-index: 2

Yanzong Zheng

Citations: 3

h-index: 1

Junyong Lin

Citations: 31

h-index: 3

Hui Xiong

Citations: 31

h-index: 4

Z. Xia

Citations: 59

h-index: 3

구조화된 테이블은 금융, 의료, 과학 연구와 같은 전문 분야에서 고밀도의 정보를 전달하는 데 필수적입니다. 멀티모달 대규모 언어 모델(MLLM)의 발전에도 불구하고, 복잡한 계층 구조를 가진 테이블에 대한 추론 성능은 여전히 제한적입니다. 본 논문에서는 정량적 분석을 통해 중요한 인지적 병목 현상을 밝혀냅니다. 연구 결과, 작업의 복잡성이 증가함에 따라 관련된 시각적 영역의 수가 불균형하게 증가합니다. 이러한 처리 밀도는 MLLM 내부에서 "인지적 과부하"를 유발하여, 암묵적인 생성 과정에서 정확한 공간적 주의를 유지하는 데 어려움을 겪게 됩니다. 이러한 병목 현상을 해결하기 위해, 공간적 추론에 특화된 대규모 벤치마크인 TableVision을 제안합니다. TableVision은 표 형태의 작업을 13개의 하위 범주로 나누어 세 가지 인지 수준(인지, 추론, 분석)으로 분류합니다. 렌더링 기반의 결정론적 공간적 정렬 파이프라인을 사용하여, 데이터셋은 다단계 논리적 추론과 픽셀 단위의 정확한 공간적 정보를 결합하며, 총 6,799개의 고품질 추론 경로를 포함합니다. 진단적 분석 결과는 명시적인 공간적 제약 조건이 MLLM의 추론 잠재력을 크게 향상시킨다는 것을 보여줍니다. 또한, 제안하는 두 단계 분리 프레임워크는 테스트 세트에서 전체적으로 12.3%의 정확도 향상을 달성했습니다. TableVision은 문서 이해에서 인지 능력과 논리의 상호 작용에 대한 엄격한 테스트 환경과 새로운 관점을 제공합니다.

Original Abstract

Structured tables are essential for conveying high-density information in professional domains such as finance, healthcare, and scientific research. Despite the progress in Multimodal Large Language Models (MLLMs), reasoning performance remains limited for complex tables with hierarchical layouts. In this paper, we identify a critical Perception Bottleneck through quantitative analysis. We find that as task complexity scales, the number of involved discrete visual regions increases disproportionately. This processing density leads to an internal "Perceptual Overload," where MLLMs struggle to maintain accurate spatial attention during implicit generation. To address this bottleneck, we introduce TableVision, a large-scale, trajectory-aware benchmark designed for spatially grounded reasoning. TableVision stratifies tabular tasks into three cognitive levels (Perception, Reasoning, and Analysis) across 13 sub-categories. By utilizing a rendering-based deterministic grounding pipeline, the dataset explicitly couples multi-step logical deductions with pixel-perfect spatial ground truths, comprising 6,799 high-fidelity reasoning trajectories. Our empirical results, supported by diagnostic probing, demonstrate that explicit spatial constraints significantly recover the reasoning potential of MLLMs. Furthermore, our two-stage decoupled framework achieves a robust 12.3% overall accuracy improvement on the test set. TableVision provides a rigorous testbed and a fresh perspective on the synergy between perception and logic in document understanding.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!