2601.21342v1 Jan 29, 2026 cs.AI

Ostrakon-VL: 외식 및 소매 매장을 위한 도메인 전문 MLLM을 향하여

Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Guandong Kou

Citations: 1

h-index: 1

Chuanlei Dong

Citations: 0

h-index: 0

Bingkun Wei

Citations: 0

h-index: 0

Wenguo Duan

Citations: 0

h-index: 0

Gongpeng Zhao

Citations: 39

h-index: 5

Jun Zhou

Citations: 146

h-index: 6

Li Yu

Citations: 95

h-index: 4

Jichen Li

Citations: 83

h-index: 4

Shicheng Hu

Citations: 14

h-index: 2

Kai Li

Citations: 60

h-index: 4

Wei Xia

Citations: 82

h-index: 5

Zhiyong Shen

Citations: 272

h-index: 9

Zun Li

Citations: 99

h-index: 6

최근 멀티모달 대형 언어 모델(MLLM)은 범용 인식 및 추론 분야에서 상당한 진전을 이루었습니다. 그럼에도 불구하고, 외식 및 소매 매장(FSRS) 시나리오에 이를 배포하는 데에는 두 가지 주요 장애물이 존재합니다. (i) 이기종 수집 장치에서 수집된 실제 FSRS 데이터는 노이즈가 매우 심하고 감사 가능한 폐루프(closed-loop) 데이터 큐레이션이 부재하여, 고품질의 제어 가능하고 재현 가능한 훈련 말뭉치 구축을 저해합니다. (ii) 기존 평가 프로토콜은 단일 이미지, 다중 이미지, 비디오 입력을 아우르는 통합되고 세분화된 표준화 벤치마크를 제공하지 않아, 모델의 강건성을 객관적으로 측정하기 어렵게 만듭니다. 이러한 문제를 해결하기 위해, 우리는 먼저 Qwen3-VL-8B를 기반으로 한 FSRS 지향 MLLM인 Ostrakon-VL을 개발했습니다. 둘째, FSRS를 위한 최초의 공개 벤치마크인 ShopBench를 소개합니다. 셋째, 다단계 멀티모달 지시 데이터 큐레이션 파이프라인인 QUAD(Quality-aware Unbiased Automated Data-curation)를 제안합니다. 다단계 훈련 전략을 활용하여 Ostrakon-VL은 ShopBench에서 평균 60.1점을 기록하며, 유사한 파라미터 규모와 다양한 아키텍처를 가진 오픈 소스 MLLM 중에서 새로운 최고 성능(SOTA)을 달성했습니다. 특히, 훨씬 더 큰 Qwen3-VL-235B-A22B(59.4)를 0.7점 차이로 능가하고, 동일한 규모의 Qwen3-VL-8B(55.3)를 4.8점 차이로 앞서며, 파라미터 효율성이 크게 향상되었음을 입증했습니다. 이러한 결과는 Ostrakon-VL이 더욱 강건하고 신뢰할 수 있는 FSRS 중심의 인식 및 의사 결정 기능을 제공함을 나타냅니다. 재현 가능한 연구를 촉진하기 위해 Ostrakon-VL과 ShopBench 벤치마크를 공개할 예정입니다.

Original Abstract

Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!