2603.29387v1 Mar 31, 2026 cs.CV

Extend3D: 도시 규모의 3차원 콘텐츠 생성

Extend3D: Town-Scale 3D Generation

Seungwoo Yoon

Citations: 9

h-index: 2

Jinmo Kim

Citations: 93

h-index: 6

Jaesik Park

Citations: 3

h-index: 1

본 논문에서는 단일 이미지로부터 3차원 장면을 생성하는 훈련 불필요 파이프라인인 Extend3D를 제안합니다. Extend3D는 객체 중심 3차원 생성 모델을 기반으로 구축되었습니다. 객체 중심 모델에서 넓은 장면을 표현하는 데 제한이 되는 고정 크기 잠재 공간의 한계를 극복하기 위해, 우리는 잠재 공간을 x축 및 y축 방향으로 확장합니다. 확장된 잠재 공간을 겹치는 패치로 나누고, 각 패치에 객체 중심 3차원 생성 모델을 적용한 다음, 각 타임 스텝에서 이들을 연결합니다. 이미지 조건부 패치 기반 3차원 생성은 이미지와 잠재 공간 패치 간의 엄격한 공간 정렬을 요구하므로, 단안 깊이 추정기를 사용한 포인트 클라우드 사전 정보를 통해 장면을 초기화하고, SDEdit을 통해 가려진 영역을 반복적으로 개선합니다. 3차원 구조의 불완전성을 3차원 개선 과정에서 노이즈로 처리하면 3차원 완성 작업이 가능하며, 이를 우리는 '언더-노이징(under-noising)'이라고 명명했습니다. 또한, 하위 장면 생성에 대한 객체 중심 모델의 최적성 문제를 해결하기 위해, 노이징 과정에서 확장된 잠재 공간을 최적화하여 노이징 경로가 하위 장면의 동역학과 일관성을 유지하도록 합니다. 이를 위해, 개선된 기하학적 구조와 텍스처 충실도를 위한 3차원 인지 최적화 목표를 도입했습니다. 우리의 방법은 인간의 선호도 및 정량적 실험을 통해 기존 방법보다 더 나은 결과를 제공한다는 것을 보여줍니다.

Original Abstract

In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!