2604.08213v1 Apr 09, 2026 cs.CV

EditCaption: 인간 지향적인 이미지 편집을 위한 명령어 합성: 지도 학습 미세 조정 및 직접 선호도 최적화

EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Yao Hu

Citations: 22

h-index: 3

Honghao Cai

Citations: 10

h-index: 2

Xiangyuan Wang

Citations: 17

h-index: 2

Tianze Zhou

Citations: 36

h-index: 3

Yibo Chen

Citations: 13

h-index: 2

Hao Chen

Citations: 144

h-index: 6

Xu Tang

Citations: 65

h-index: 5

Wei Zhu

Citations: 58

h-index: 4

Yunhao Bai

Citations: 18

h-index: 2

고품질의 훈련 데이터셋(정밀한 편집 지침이 포함된 원본-대상 이미지 쌍)은 명령어 기반 이미지 편집 모델의 성능 향상에 있어 중요한 제약 요인입니다. 시각-언어 모델(VLM)은 자동 명령어 생성에 널리 사용되지만, 이미지 쌍 환경에서 발생하는 세 가지 주요 문제점을 파악했습니다. 이러한 문제점은 방향 불일치(예: 좌우 혼동), 시점의 모호성, 그리고 세밀한 속성 설명의 부족입니다. 인간 평가 결과, 강력한 기본 VLM에서 생성된 명령어 중 47% 이상이 심각한 오류를 포함하여 다운스트림 훈련에 사용될 수 없는 것으로 나타났습니다. 본 연구에서는 VLM 기반 명령어 생성의 확장 가능성을 높이기 위해 EditCaption이라는 두 단계로 구성된 사후 훈련 파이프라인을 제안합니다. 1단계에서는 GLM 자동 주석, EditScore 기반 필터링, 그리고 인간의 검토를 결합하여 공간, 방향, 속성 수준의 정확성을 확보하는 10만 개의 지도 학습(SFT) 데이터셋을 구축합니다. 2단계에서는 세 가지 문제점에 대한 1만 개의 인간 선호도 쌍을 수집하고, SFT만으로는 달성할 수 없는 수준의 인간 지향성을 확보하기 위해 직접 선호도 최적화(DPO)를 적용합니다. Eval-400, ByteMorph-Bench, 및 HQ-Edit 데이터셋에서 미세 조정된 Qwen3-VL 모델은 오픈 소스 모델보다 우수한 성능을 보입니다. 특히, 235B 모델은 Eval-400에서 4.712의 성능을 달성했으며(Gemini-3-Pro: 4.706, GPT-4.1: 4.220, Kimi-K2.5: 4.111), ByteMorph-Bench에서는 4.588의 성능을 달성했습니다(Gemini-3-Pro: 4.522, GPT-4.1: 3.412). 인간 평가 결과, 심각한 오류 비율은 47.75%에서 23%로 감소하고, 정확도는 41.75%에서 66%로 향상되었습니다. 본 연구는 이미지 편집 데이터에 대한 확장 가능하고 인간 지향적인 명령어 생성 방법을 제시합니다.

Original Abstract

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!