2602.18882v1 Feb 21, 2026 cs.CV

SceneTok: 3D 장면을 위한 압축 및 확산 가능한 토큰 공간

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

J. E. Lenssen

Citations: 10,374

h-index: 25

Mohammad Asim

Citations: 96

h-index: 2

Christopher Wewer

Citations: 342

h-index: 8

우리는 장면(scene)의 뷰 세트(view sets)를 압축되고 확산 가능한 비정형 토큰 세트로 인코딩하는 새로운 토크나이저인 SceneTok을 제안한다. 3D 장면 표현 및 생성을 위한 기존의 접근법들은 일반적으로 3D 데이터 구조나 뷰 정렬 필드(view-aligned fields)를 사용한다. 이와 대조적으로, 우리는 공간 그리드(spatial grid)에서 분리된, 순열 불변(permutation-invariant)의 소규모 토큰 세트로 장면 정보를 인코딩하는 최초의 방법을 소개한다. 장면 토큰은 다수의 컨텍스트 뷰가 주어질 때 다중 뷰 토크나이저(multi-view tokenizer)에 의해 예측되며, 경량화된 렉티파이드 플로우 디코더(rectified flow decoder)를 사용하여 새로운 뷰(novel views)로 렌더링된다. 우리는 이 압축 방식이 다른 표현 방식들에 비해 1~3 자릿수(orders of magnitude) 더 강력하면서도 여전히 최고 수준의(state-of-the-art) 재구성 품질을 달성함을 보여준다. 또한, 우리의 표현은 입력 궤적에서 벗어난 궤적을 포함하여 새로운 궤적에서도 렌더링될 수 있으며, 디코더가 불확실성을 유연하게 처리할 수 있음을 입증한다. 마지막으로, 고도로 압축된 비정형 잠재 장면 토큰(latent scene tokens) 세트는 5초 만에 간단하고 효율적인 장면 생성을 가능하게 하여 이전의 패러다임보다 훨씬 더 나은 품질-속도 트레이드오프를 달성한다.

Original Abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

1 Citations

0 Influential

12.5 Altmetric

63.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!