2603.20020v1 Mar 20, 2026 cs.CV

분리된 건너뛰기 연결과 R-Probe: MLLM OCR을 위한 특징 집계와 기울기 전파의 분리

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

Ming Zhang

Citations: 8,110

h-index: 5

Daxiang Dong

Citations: 20

h-index: 2

Ziye Yuan

Citations: 9

h-index: 2

Ruchang Yao

Citations: 3

h-index: 1

Chengxin Zheng

Citations: 17

h-index: 3

Yusheng Zhao

Citations: 1,039

h-index: 14

다중 모드 대규모 언어 모델(MLLM)은 고차원적인 추론 능력에서 뛰어난 성능을 보이지만, 미세한 시각적 세부 정보가 손실되거나 정렬되지 않는 OCR 작업에서는 어려움을 겪습니다. 본 연구에서는 다층 특징 융합 과정에서 간과되어 온 최적화 문제를 지적합니다. 건너뛰기 경로는 고차원적인 의미론적 목표에서 초기 시각적 계층으로 직접적인 역전파 경로를 제공합니다. 이러한 메커니즘은 저차원적인 신호를 덮어쓰고 훈련을 불안정하게 만듭니다. 이러한 기울기 간섭을 완화하기 위해, 본 연구에서는 Detached Skip-Links를 제안합니다. 이는 공동 훈련 중에 건너뛰기 경로를 통해 기울기가 전달되지 않도록 하여 얕은 특징을 재사용하는 최소한의 수정 방식입니다. 이러한 비대칭적인 설계는 학습 가능한 파라미터를 추가하지 않고도 기울기 간섭을 줄여 안정성과 수렴성을 향상시킵니다. 또한, LLM이 미세한 정보가 보존되고 활용 가능한지 진단하기 위해, LLM 레이어의 첫 번째 1/4에서 초기화된 얕은 디코더를 사용하여 투영된 시각적 토큰의 픽셀 수준 재구성 가능성을 측정하는 R-Probe를 소개합니다. 다양한 ViT 기반 모델과 다중 모드 벤치마크에서, 그리고 최대 7백만 개의 훈련 샘플 규모에서, 본 연구의 접근 방식은 OCR 관련 벤치마크에서 일관되게 성능을 향상시키고, 일반적인 다중 모드 작업에서도 뚜렷한 이점을 제공합니다.

Original Abstract

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!