2602.02051v1 Feb 02, 2026 cs.AI

SIDiffAgent: 자가 개선형 디퓨전 에이전트

SIDiffAgent: Self-Improving Diffusion Agent

Shivank Garg

Citations: 37

h-index: 4

Ayush Singh

Citations: 28

h-index: 3

Gaurav Nayak

Citations: 0

h-index: 0

텍스트-이미지 디퓨전 모델은 고품질의 사실적인 이미지 합성을 가능하게 하여 생성형 AI의 혁명을 일으켰습니다. 그러나 프롬프트 문구에 대한 민감성, 의미 해석의 모호성(예: '마우스'가 동물인지 컴퓨터 주변기기인지), 왜곡된 해부학적 구조와 같은 아티팩트, 그리고 정교하게 설계된 입력 프롬프트의 필요성 등 여러 한계로 인해 실제 배포에는 여전히 어려움이 있습니다. 기존 방법들은 종종 추가 학습을 필요로 하고 제어 가능성이 제한적이어서 실제 응용 분야에서의 적응성을 저해합니다. 우리는 이러한 문제를 해결하기 위해 Qwen 모델 제품군(Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding)을 활용하는 훈련이 필요 없는 에이전트 프레임워크인 자가 개선형 디퓨전 에이전트(SIDiffAgent)를 소개합니다. SIDiffAgent는 프롬프트 엔지니어링을 자율적으로 관리하고, 품질이 낮은 생성물을 감지 및 수정하며, 정밀한 아티팩트 제거를 수행하여 더 신뢰할 수 있고 일관된 결과물을 산출합니다. 또한 이전 경험의 기억을 데이터베이스에 저장함으로써 반복적인 자가 개선 기능을 통합합니다. 이 과거 경험 데이터베이스는 에이전트 파이프라인의 각 단계에서 프롬프트 기반 지침을 주입하는 데 사용됩니다. SIDiffAgent는 GenAIBench에서 평균 VQA 점수 0.884를 달성하여 오픈 소스, 독점 모델 및 에이전트 방법들을 크게 능가했습니다. 우리는 논문 채택 시 코드를 공개할 예정입니다.

Original Abstract

Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse" as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact removal, yielding more reliable and consistent outputs. It further incorporates iterative self-improvement by storing a memory of previous experiences in a database. This database of past experiences is then used to inject prompt-based guidance at each stage of the agentic pipeline. \modelour achieved an average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and agentic methods. We will publicly release our code upon acceptance.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!