2601.15664v1 Jan 22, 2026 cs.CV

Skywork UniPic 3.0: 시퀀스 모델링을 통한 통합 다중 이미지 합성

Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

Yahui Zhou

Citations: 1,330

h-index: 16

Hongbo Liu

Citations: 43

h-index: 3

Size Wu

Citations: 777

h-index: 11

Hongyang Wei

Citations: 173

h-index: 7

Zidong Wang

Citations: 55

h-index: 3

Yi Peng

Citations: 393

h-index: 10

Baixin Xu

Citations: 179

h-index: 5

Xuying Zhang

Citations: 9

h-index: 2

Xianglong He

Citations: 350

h-index: 7

Zexiang Liu

Citations: 154

h-index: 4

Peiyu Wang

Citations: 193

h-index: 7

Xuchen Song

Citations: 347

h-index: 10

Yangguang Li

Citations: 9

h-index: 2

Yang Liu

Citations: 207

h-index: 7

최근 Nano-Banana 및 Seedream 4.0의 인기는 다중 이미지 합성 작업에 대한 커뮤니티의 높은 관심을 반영합니다. 단일 이미지 편집과 비교했을 때, 다중 이미지 합성은 일관성과 품질 측면에서 훨씬 더 큰 어려움을 제시합니다. 그러나 기존 모델들은 고품질 합성을 달성하기 위한 구체적인 방법론적 세부 사항을 공개하지 않았습니다. 통계적 분석을 통해, 커뮤니티에서 가장 중요하게 생각하는 요소가 인간-객체 상호작용(HOI)이라는 것을 확인했습니다. 따라서 우리는 다중 이미지 합성, 특히 HOI 중심 작업에 중점을 둔 최첨단 솔루션을 체계적으로 분석하고 구현했습니다. 본 논문에서는 단일 이미지 편집과 다중 이미지 합성을 통합한 통합 멀티모달 프레임워크인 Skywork UniPic 3.0을 제시합니다. 우리의 모델은 입력 이미지의 개수(1~6개)와 해상도를 자유롭게 설정할 수 있으며, 출력 해상도 또한 총 1024x1024 픽셀 범위 내에서 자유롭게 설정할 수 있습니다. 다중 이미지 합성의 어려움을 해결하기 위해, 우리는 포괄적인 데이터 수집, 필터링 및 합성 파이프라인을 설계하여, 700K개의 고품질 학습 샘플만을 사용하여 강력한 성능을 달성했습니다. 또한, 다중 이미지 합성을 시퀀스 모델링 문제로 재구성하여 조건부 생성을 통합 시퀀스 합성으로 변환하는 새로운 학습 패러다임을 도입했습니다. 추론 속도를 높이기 위해, 우리는 후처리 단계에서 트래jectory 매핑 및 분포 매칭을 통합하여 모델이 8단계 만에 고품질 샘플을 생성하고 표준 합성 샘플링보다 12.5배 빠른 속도를 달성하도록 했습니다. Skywork UniPic 3.0은 단일 이미지 편집 벤치마크에서 최고 성능을 달성했으며, 다중 이미지 합성 벤치마크에서 Nano-Banana 및 Seedream 4.0을 모두 능가하여, 우리의 데이터 파이프라인 및 학습 패러다임의 효과를 검증했습니다. 코드, 모델 및 데이터셋은 공개적으로 이용 가능합니다.

Original Abstract

The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community's strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.

5 Citations

0 Influential

8 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!