2601.16210v2 Jan 22, 2026 cs.CV

PyraTok: 비디오 이해 및 생성을 위한 언어 연계 피라미드 토크나이저

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Ismini Lourentzou

Citations: 1,280

h-index: 18

Onkar Susladkar

Citations: 206

h-index: 8

Tushar Prakash

Citations: 0

h-index: 0

Adheesh Juvekar

Citations: 40

h-index: 3

Kiet A. Nguyen

Citations: 23

h-index: 3

Dong-Hwan Jang

Citations: 93

h-index: 2

I. Dhillon

Citations: 39,247

h-index: 91

최신 텍스트-비디오 생성 및 비디오 이해 시스템의 기반이 되는 이산 비디오 VAE는 기존 토크나이저가 단일 스케일에서 제한된 어휘와 피상적인 언어 지침을 사용하여 시각적 코드북을 학습하기 때문에, 모달 간 정렬이 미흡하고 제로샷 성능이 좋지 않은 경향이 있습니다. 본 논문에서는 PyraTok을 소개합니다. PyraTok은 여러 시공간 해상도에서 의미적으로 구조화된 이산 잠재 변수를 학습하는 언어 연계 피라미드 토크나이저입니다. PyraTok은 사전 훈련된 비디오 VAE와 새로운 언어 연계 피라미드 양자화(LaPQ) 모듈을 기반으로 하며, 공유된 대규모 이진 코드북을 사용하여 인코더 특징을 여러 깊이에서 이산화하여, 작고 표현력이 풍부한 비디오 토큰 시퀀스를 생성합니다. 시각적 토큰과 언어를 밀접하게 연결하기 위해, PyraTok은 다중 스케일 텍스트 기반 양자화와 토큰 계층 구조에 대한 전역 자기 회귀 목표를 동시에 최적화합니다. 열 개의 벤치마크에서 PyraTok은 최첨단(SOTA) 비디오 재구성 성능을 달성하고, 텍스트-비디오 품질을 지속적으로 향상시키며, 비디오 분할, 시간 액션 로컬라이제이션 및 비디오 이해에 대한 새로운 SOTA 제로샷 성능을 설정합니다. 또한 4K/8K 해상도까지 안정적으로 확장됩니다.

Original Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!