2604.16785v1 Apr 18, 2026 cs.CV

거친 인식과 정밀 인식을 연결하는 방법: 인터랙티브 교육 게임에서의 개방형 다단계 객체 인식을 위한 하이브리드 접근 방식

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games

Hanling Yi

Citations: 50

h-index: 3

Feng Lin

Citations: 154

h-index: 3

Yifan Yang

Citations: 20

h-index: 2

Xiaotian Yu

Citations: 52

h-index: 3

Rong Xiao

Citations: 44

h-index: 2

Mao-Lin Luo

Citations: 17

h-index: 2

최근 다중 모달 대규모 언어 모델(MLLM)의 발전은 개방형 객체 인식을 가능하게 했지만, 미세한 작업에는 어려움을 겪습니다. 반면, CLIP 스타일 모델은 미세한 인식에 뛰어난 성능을 보이지만, 일반적인 객체 범주에 대한 광범위한 적용은 부족합니다. 이러한 간극을 해소하기 위해, 우리는 MLLM과 CLIP 모델을 통합하는 하이브리드 다단계 개방형 객체 인식 프레임워크인 HyMOR을 제안합니다. HyMOR에서 MLLM은 개방형 및 거친 객체 인식을 수행하고, CLIP 모델은 동물 및 식물과 같은 특정 도메인의 객체를 미세하게 식별하는 데 특화됩니다. 이러한 하이브리드 설계는 다양한 의미 수준에서 정확한 객체 이해를 가능하게 하며, 다운스트림의 다중 모달 콘텐츠 생성 및 인터랙티브 게임 플레이를 위한 강력한 인지적 기반을 제공합니다. 콘텐츠가 풍부하고 교육적인 시나리오에서의 평가를 지원하기 위해, 우리는 교과서에서 추출한 8,816개의 객체 범주로 주석이 달린 20,942개의 이미지로 구성된 데이터셋인 TBO(TextBook Objects)를 소개합니다. 광범위한 실험 결과, HyMOR은 CLIP과의 미세한 인식 격차를 0.2%로 줄이고, 평균 Sentence-BERT(SBert) 유사성을 기준으로 MLLM 기준보다 일반적인 객체 인식을 2.5% 향상시키는 것으로 나타났습니다. 전반적으로, HyMOR은 모든 평가 데이터셋에서 평균 SBert 점수를 23.2% 향상시켜, 다중 모달 게임 콘텐츠 생성 및 인터랙티브 학습 애플리케이션을 위한 정확한 인지를 가능하게 하는 효과를 입증합니다.

Original Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!