2605.14876v1 May 14, 2026 cs.CV

검증된 추론을 통한 복잡한 시각 콘텐츠 생성의 새로운 가능성

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Ruowang Zhang

Citations: 28

h-index: 3

Hanbo Cheng

Citations: 15

h-index: 2

Limin Lin

Citations: 30

h-index: 4

Yicheng Pan

Citations: 27

h-index: 2

Jun Du

Citations: 139

h-index: 7

최근 텍스트-이미지(T2I) 모델은 빠른 발전을 이루었지만, 대부분의 모델은 단일 단계 생성 방식을 사용하며, 이는 복잡한 의미를 처리하는 데 어려움을 겪고, 파라미터 증가에 따른 효율성이 감소하는 문제를 안고 있습니다. 최근 다단계 추론 방식이 유망한 결과를 보여주지만, 이는 검증되지 않은 계획으로 인한 환각, 일괄적인 사후 반영, 긴 컨텍스트 최적화의 불안정성, 그리고 엄청난 추론 지연 등의 문제점을 가지고 있습니다. 이러한 문제점을 해결하기 위해, 우리는 시각-언어 논리적 계획과 픽셀 단위의 확산 생성 방식을 깊이 결합한 포괄적인 시스템인 Closed-Loop Visual Reasoning (CLVR) 프레임워크를 제안합니다. CLVR은 신뢰할 수 있는 추론 경로를 생성하기 위한 단계별 시각적 검증 기능을 갖춘 자동 데이터 엔진을 도입하고, 장기 컨텍스트 최적화의 불안정성을 해결하기 위해, 상호 연결된 다중 모드 히스토리를 명시적인 보상 신호로 변환하는 Proxy Prompt Reinforcement Learning (PPRL)을 제안합니다. 또한, 반복적인 노이즈 제거로 인해 발생하는 심각한 지연 문제를 완화하기 위해, 이론적으로 뒷받침되는 방법인 $Δ$-Space Weight Merge (DSWM)을 제안합니다. DSWM은 정렬 가중치를 기존의 지식 증류 사전 정보와 결합하여, 단계별 추론 비용을 값비싼 재-지식 증류 없이도 단 4개의 신경망 전파(NFE)로 줄입니다. 광범위한 실험 결과, CLVR은 기존의 공개 소스 모델들을 능가하는 성능을 보여주며, 독점적인 상용 모델에 근접하는 성능을 달성하여, 복잡한 시각 콘텐츠 생성에 대한 일반적인 테스트 시간 확장 기능을 제공합니다.

Original Abstract

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!