2605.06708v1 May 06, 2026 cs.CV

시각 텍스트 압축: 측정 변환으로서의 접근

Visual Text Compression as Measure Transport

Lv Tang

Citations: 837

h-index: 13

Tianyi Zheng

Citations: 105

h-index: 5

Yang Liu

Citations: 72

h-index: 4

Bo Li

Citations: 427

h-index: 10

Xing Li

Citations: 17

h-index: 1

시각 텍스트 압축(VTC)은 텍스트를 이미지로 변환하고, 시각-언어 모델을 사용하여 재인코딩함으로써, 종종 서브워드 토큰화 방식보다 $3$배에서 $20$배 적은 디코더 토큰을 생성하여 효율적인 장거리 문맥 처리를 가능하게 합니다. 하지만 토큰 절약이 항상 다운스트림 작업의 유용성으로 이어지는 것은 아닙니다. 일부 작업에서는 시각적 경로가 텍스트 경로와 유사하거나 더 나은 성능을 보이지만, 다른 작업에서는 성능이 저하됩니다. 또한, 압축 비율 자체가 어떤 작업에서 어떤 성능이 나타날지 예측하지 못합니다. 따라서 필요한 것은 효율성 요약이 아니라, 시각적 인코딩에 의해 유발되는 작업 관련 정보 손실에 대한 체계적인 측정 기준입니다. 우리는 VTC를 측정 변환의 언어로 공식화하여 이 문제를 해결합니다. 텍스트와 시각적 토큰을 경험적 확률 측정값으로 취급하고, ViT 패치 인코더가 푸쉬-포워드 맵을 유도하며, 이 변환 비용이 패치 내 집계에서 발생하는 정밀도 비용과 패치 간 분열에서 발생하는 커버리지 비용으로 분해된다는 것을 보여줍니다. 두 가지 용어 모두 다운스트림 레이블 없이 측정할 수 있습니다. 이 공식화는 두 가지 실질적인 결과를 제공합니다. 첫째, 주어진 입력 또는 벤치마크 인스턴스에 대해 시각적 경로를 사용할지 여부를 선택하는 다운스트림 레이블 없이 측정 가능한 라우팅 기준입니다. 둘째, 변환 정보를 활용한 포비에이션 메커니즘으로, 높은 비용의 영역을 더 높은 해상도로 재인코딩합니다. Qwen3-4B를 사용하여 $24$개의 NLP 데이터 세트에 대한 실험에서, 우리의 레이블 없는 규칙이 $24$개 데이터 세트 중 $17$개($70.8$%)에서 각 데이터 세트에 대한 최적의 성능과 일치하며, 순수 LLM에 비해 평균 작업 점수를 $+3.3$% 향상시키고 평균 토큰 수를 $-10.3$% 감소시켰습니다.

Original Abstract

Visual text compression (VTC) promises efficient long-context processing by rendering text into an image and re-encoding it with a vision-language model, often producing $3$--$20\times$ fewer decoder tokens than subword tokenization. Yet token savings do not translate predictably into downstream utility: on some tasks the visual path matches or exceeds the text path, on others it collapses, and the compression ratio itself does not predict which regime will occur. The missing quantity is therefore not another summary of efficiency, but a principled measure of task-relevant information loss induced by visual encoding. We address this problem by formulating VTC in the language of measure transport. Treating text and visual tokens as empirical probability measures, we show that the ViT patch encoder induces a push-forward map whose transport cost decomposes into a precision cost from within-patch aggregation and a coverage cost from cross-patch fragmentation. Both terms are estimable from downstream-label-free probes. This formulation yields two operational consequences: a downstream-label-free routing criterion that selects whether to use the visual path for a given input or benchmark instance, and a transport-informed foveation mechanism that re-encodes high-cost regions at higher resolution. Across $24$ NLP datasets at Qwen3-4B, our label-free rule matches the per-dataset oracle on $17/24$ datasets ($70.8\%$), and improves the average task score by $+3.3\%$ with $-10.3\%$ average tokens relative to a pure-LLM.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!