2603.22796v1 Mar 24, 2026 cs.CV

PhotoAgent: 공간적 이해와 심미적 인식을 갖춘 로봇 사진작가

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Zhe Gan

Citations: 566

h-index: 7

Lirong Che

Citations: 0

h-index: 0

Junbo Tan

Citations: 546

h-index: 15

Yanbo Chen

Citations: 66

h-index: 3

Xueqian Wang

Citations: 7

h-index: 1

사진 촬영과 같은 창의적인 작업을 수행하는 로봇은 고수준의 언어 명령과 기하학적 제어 사이의 의미 격차를 해소해야 합니다. 본 논문에서는 대규모 다중 모드 모델(LMM)의 추론과 새로운 제어 방식을 통합하여 이러한 문제를 해결하는 PhotoAgent를 소개합니다. PhotoAgent는 먼저 LMM 기반의 연쇄적 사고(Chain-of-Thought, CoT) 추론을 통해 주관적인 심미적 목표를 해결 가능한 기하학적 제약 조건으로 변환하여, 분석적 솔버가 고품질의 초기 시점을 계산하도록 합니다. 이 초기 자세는 3D Gaussian Splatting (3DGS)으로 구축된 사실적인 내부 세계 모델 내에서 시각적 피드백을 통해 반복적으로 개선됩니다. 이러한 ``정신 시뮬레이션''은 비용이 많이 들고 느린 물리적 시행착오를 대체하여, 심미적으로 우수한 결과로 빠르게 수렴할 수 있도록 합니다. 실험 결과, PhotoAgent는 공간 추론 능력에서 뛰어난 성능을 보이며 최종 이미지 품질 또한 우수함을 확인했습니다.

Original Abstract

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!