2602.06973v1 Jan 12, 2026 cs.CL

시각적 렌더링이 토큰화 과정을 우회하는가? 픽셀 기반 언어 모델에서 발생하는 스크립트-토크나이저 불일치 현상 연구

Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Lucky Susanto

Citations: 127

h-index: 5

M. Wijanarko

Citations: 13

h-index: 3

Khumaisa Nur'aini

Monash University Indonesia

Citations: 58

h-index: 2

Farid Adilazuarda

Citations: 0

h-index: 0

Alham Fikri Aji

MBZUAI

Citations: 8,673

h-index: 37

Derry Wijaya

Citations: 6

h-index: 2

픽셀 기반 언어 모델링은 텍스트를 이미지로 렌더링하여 서브워드 토큰화의 한계를 극복하고자 하지만, 최근에는 DualGPT와 같은 다중 모드 모델이 자동 회귀 성능 향상을 위해 텍스트 토크나이저를 다시 도입하고 있습니다. 본 연구에서는 시각적 렌더링이 모델을 토큰화 제약으로부터 완전히 벗어나게 하는지 여부에 대한 근본적인 질문을 탐구합니다. 자바, 발리, 순다, 랑꿍어와 같이 자체적인 비라틴 문자 체계를 가진 인도네시아의 저자원 지역 언어 네 가지를 대상으로, DualGPT 아키텍처 내에서 스크립트-토크나이저 정렬이 미치는 영향을 평가했습니다. 그 결과, 시각적 렌더링에도 불구하고 텍스트 토크나이저를 아키텍처에 다시 통합하면 픽셀 기반 언어 모델링이 해결하고자 했던 토크나이저 불일치 문제가 다시 발생한다는 것을 확인했습니다. Llama 2 토크나이저는 OOV(Out-of-Vocabulary) 및 생성률이 낮음에도 불구하고, 맞춤형 토크나이저에 비해 성능이 현저히 떨어지는 것을 확인했으며, chrF++ 점수가 최대 30.15%까지 낮아지는 것을 보였습니다. 본 연구 결과는 향후 다중 모드 모델 개발에 경고를 제공하며, 텍스트 토크나이저는 공정하고 효과적인 모델을 구축하는 데 여전히 중요한 장애물이라는 점을 시사합니다.

Original Abstract

While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

0 Citations

0 Influential

18.5 Altmetric

92.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!