2603.03143v1 Mar 03, 2026 cs.CV

기하학 기반 강화 학습을 이용한 다중 시점 일관성을 갖는 3D 장면 편집

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Guosheng Lin

Citations: 2

h-index: 1

Zhenlong Yuan

Citations: 121

h-index: 7

Jiyuan Wang

Citations: 60

h-index: 4

Lei Sun

Citations: 291

h-index: 9

Zhiqun Cao

Citations: 0

h-index: 0

Yuyang Yin

Citations: 329

h-index: 5

Lang Nie

Citations: 1,192

h-index: 16

Xiangxiang Chu

Citations: 155

h-index: 8

K. Liao

Citations: 1,592

h-index: 18

Chunyu Lin

Citations: 1,645

h-index: 18

Yunchao Wei

Citations: 5

h-index: 2

2D 확산 모델의 사전 지식을 활용하여 3D 편집을 수행하는 방식은 유망한 패러다임으로 떠오르고 있습니다. 그러나 편집 결과물의 다중 시점 일관성을 유지하는 것은 여전히 어려운 과제이며, 3D 일관성을 갖춘 편집 데이터의 극심한 부족으로 인해, 편집 작업에 가장 효과적인 학습 전략인 지도 미세 조정(SFT)이 불가능합니다. 본 논문에서는 다중 시점 일관성을 갖는 3D 콘텐츠를 생성하는 것은 매우 어렵지만, 3D 일관성을 검증하는 것은 상대적으로 용이하다는 점을 관찰했습니다. 이러한 점을 바탕으로, 강화 학습(RL)을 실현 가능한 해결책으로 제시합니다. 이에 따라, 본 논문에서는 3D 기반 모델인 VGGT에서 파생된 새로운 보상을 활용하는 RL 최적화를 기반으로 하는 단일 패스 프레임워크인 RL3DEdit을 제안합니다. 구체적으로, VGGT가 대규모 실제 데이터로부터 학습한 강력한 사전 지식을 활용하고, 편집된 이미지를 입력하여 출력된 신뢰도 맵과 자세 추정 오류를 보상 신호로 사용하여, RL을 통해 2D 편집의 사전 지식을 3D 일관성 매니폴드에 효과적으로 연결합니다. 광범위한 실험 결과, RL3DEdit은 안정적인 다중 시점 일관성을 달성하며, 높은 효율성으로 최첨단 방법보다 우수한 편집 품질을 보여줍니다. 3D 편집 기술 개발을 촉진하기 위해, 코드와 모델을 공개할 예정입니다.

Original Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!