2603.08652v1 Mar 09, 2026 cs.AI

CoCo: 텍스트-이미지 변환의 미리보기 및 희귀 개념 생성을 위한 코드 기반의 추론

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Haodong Li

Citations: 7

h-index: 2

Juanxi Tian

Citations: 35

h-index: 3

Hong Peng

Citations: 19

h-index: 2

Yuhong Dai

Citations: 19

h-index: 2

Chunmei Qing

Citations: 7

h-index: 2

Huan Zhang

Citations: 14

h-index: 3

Dongzhi Jiang

Citations: 1,360

h-index: 13

Yihang Zou

Citations: 6

h-index: 1

Dingming Li

Citations: 67

h-index: 2

Zepeng Lin

Citations: 24

h-index: 3

Yi Zhou

Citations: 2

h-index: 1

Siqi Dai

Citations: 6

h-index: 1

Jingwei Wu

Citations: 3

h-index: 1

최근 통합 다중 모드 모델(Unified Multimodal Models, UMM)의 발전은 특히 체인 오브 씽킹(Chain-of-Thought, CoT) 추론을 통합함으로써 텍스트-이미지(T2I) 생성 분야에 크게 기여했습니다. 그러나 기존의 CoT 기반 T2I 방법은 주로 추상적인 자연어 계획에 의존하는데, 이는 복잡한 공간 배치, 구조화된 시각적 요소 및 밀집된 텍스트 콘텐츠에 필요한 정밀성을 제공하지 못합니다. 본 연구에서는 CoCo(Code-as-CoT)라는 코드 기반 추론 프레임워크를 제안합니다. CoCo는 추론 과정을 실행 가능한 코드로 표현하여 이미지 생성에 대한 명시적이고 검증 가능한 중간 계획을 가능하게 합니다. 주어진 텍스트 프롬프트에 대해, CoCo는 먼저 장면의 구조적 레이아웃을 지정하는 실행 가능한 코드를 생성하고, 이 코드는 샌드박스 환경에서 실행되어 결정적인 초안 이미지를 생성합니다. 모델은 이후 이 초안 이미지를 미세 조정하여 최종 고품질 결과를 생성합니다. 이러한 훈련 패러다임을 지원하기 위해, 우리는 구조화된 초안-최종 이미지 쌍을 포함하는 큐레이션된 데이터셋인 CoCo-10K를 구축했습니다. 이 데이터셋은 구조화된 초안 구성과 수정된 시각적 개선을 모두 학습하도록 설계되었습니다. StructT2IBench, OneIG-Bench 및 LongText-Bench에 대한 실험 결과, CoCo는 직접 생성 방법에 비해 +68.83%, +54.8% 및 +41.23%의 성능 향상을 보여주었으며, CoT를 활용한 다른 생성 방법보다도 우수한 성능을 보였습니다. 이러한 결과는 실행 가능한 코드가 정밀하고 제어 가능하며 구조화된 텍스트-이미지 생성에 효과적이고 신뢰할 수 있는 추론 패러다임임을 보여줍니다. 코드: https://github.com/micky-li-hd/CoCo

Original Abstract

Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo

0 Citations

0 Influential

45.959101490553 Altmetric

229.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!