2601.13260v2 Jan 19, 2026 cs.CL

토크나이저를 당연하게 여기지 마세요: 이는 거대 언어 모델의 핵심 설계 결정입니다.

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Md Tahmid Rahman Laskar

York University

Citations: 1,596

h-index: 19

Sawsan Alqahtani

Citations: 322

h-index: 9

Mir Tafseer Nayeem

Citations: 417

h-index: 11

Tasnim Mohiuddin

Citations: 394

h-index: 11

M Saiful Bari

Nanyang Technological University

Citations: 7,124

h-index: 16

토크나이징은 모든 거대 언어 모델의 기반이 되지만, 아직 이론적으로 충분히 연구되지 않았고 일관성 없는 설계 방식을 가지고 있습니다. 바이트 쌍 인코딩(BPE)과 같은 일반적인 서브워드 방식은 확장성을 제공하지만, 종종 언어 구조와 일치하지 않고, 편향을 증폭시키며, 다양한 언어 및 도메인에서 용량을 낭비합니다. 본 논문에서는 토크나이징을 전처리 단계가 아닌 핵심 모델링 결정으로 재정의합니다. 우리는 언어, 도메인 및 배포 환경을 고려하여 토크나이저와 모델의 공동 설계를 통합하는 문맥 인식 프레임워크를 제안합니다. 표준화된 평가와 투명한 보고는 토크나이징 선택에 대한 책임성과 비교 가능성을 확보하는 데 필수적입니다. 토크나이징을 기술적인 사후 작업이 아닌 핵심 설계 문제로 취급함으로써, 더 공정하고 효율적이며 적응 가능한 언어 기술을 개발할 수 있습니다.

Original Abstract

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.

1 Citations

0 Influential

9.5 Altmetric

48.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!