2603.28474v1 Mar 30, 2026 cs.CV

CiQi-Agent: 시각, 도구 및 미학을 통합하여 중국 도자기 문화적 추론을 위한 다중 모드 에이전트

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

Zhongtian Ma

Citations: 69

h-index: 3

Pengfei Liu

Citations: 137

h-index: 2

Wenhan Wang

Citations: 56

h-index: 3

Zhixiang Zhou

Citations: 205

h-index: 3

Yanzhu Chen

Citations: 0

h-index: 0

Ziyu Lin

Citations: 0

h-index: 0

Hao Sheng

Citations: 4

h-index: 1

Hongli Ma

Citations: 7

h-index: 1

Wenqi Shao

Citations: 5,554

h-index: 22

Qiaosheng Zhang

Citations: 217

h-index: 4

Yu Qiao

Citations: 1,645

h-index: 20

고급 중국 도자기 감정은 광범위한 역사 지식, 재료 이해 및 미적 감수성을 요구하므로, 비전문가가 참여하기 어렵습니다. 문화유산에 대한 이해를 민주화하고 전문가의 감정을 지원하기 위해, 우리는 지능적인 고품 중국 도자기 분석을 위한 도메인 특화된 에이전트인 CiQi-Agent를 소개합니다. CiQi-Agent는 다중 이미지 도자기 입력을 지원하며, 시각 도구 호출 및 다중 모드 검색 증강 생성을 가능하게 하여, 여섯 가지 속성(왕조, 통치 기간, 가마 위치, 유약 색상, 장식 모티프, 용기 형태)에 걸쳐 정밀한 감정 분석을 수행합니다. CiQi-Agent는 속성 분류 외에도 미묘한 시각적 세부 사항을 파악하고, 관련 도메인 지식을 검색하며, 시각적 및 텍스트 증거를 통합하여 일관되고 설명 가능한 감정 설명을 생성합니다. 이러한 기능을 달성하기 위해, 우리는 29,596개의 도자기 표본, 51,553개의 이미지, 그리고 557,940개의 시각 질의 응답 쌍으로 구성된 대규모 전문가 주석 데이터셋 CiQi-VQA를 구축하고, 위에서 언급한 여섯 가지 속성에 맞춰진 종합적인 벤치마크 CiQi-Bench를 수립했습니다. CiQi-Agent는 지도 학습, 강화 학습 및 도구 증강 추론 프레임워크를 통해 훈련되었으며, 이 프레임워크는 시각 도구 및 다중 모드 검색 도구라는 두 가지 유형의 도구를 통합합니다. 실험 결과, CiQi-Agent (7B)는 CiQi-Bench의 여섯 가지 속성 모두에서 경쟁적인 오픈 소스 및 클로즈드 소스 모델보다 우수한 성능을 보이며, 평균적으로 GPT-5보다 12.2% 높은 정확도를 달성했습니다. 모델과 데이터셋은 공개되어 있으며, https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA 에서 확인할 수 있습니다.

Original Abstract

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

0 Citations

0 Influential

31 Altmetric

155.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!