2603.07294v1 Mar 07, 2026 cs.CV

MAviS: 조류 종을 위한 다중 모드 대화형 어시스턴트

MAviS: A Multimodal Conversational Assistant For Avian Species

Jinxing Zhou

Citations: 14

h-index: 2

R. Anwer

Citations: 6,356

h-index: 38

Hisham Cholakkal

Citations: 5,393

h-index: 35

Yevheniia Kryklyvets

Citations: 3

h-index: 1

Mohammed Irfan Kurpath

Citations: 56

h-index: 4

Sahal Shaji Mullappilly

Citations: 853

h-index: 9

F. Khan

Citations: 308

h-index: 10

Salman H. Khan

Citations: 715

h-index: 12

정밀한 이해와 종별 특화된 다중 모드 질의 응답은 생물 다양성 보전과 생태 모니터링 발전에 매우 중요합니다. 그러나 기존의 다중 모드 대규모 언어 모델은 조류 종과 같은 특수 분야에서 정확하고 문맥에 맞는 정보를 제공하는 데 어려움을 겪습니다. 이러한 한계를 극복하기 위해, 저희는 이미지, 음성, 텍스트 모드를 통합하여 1,000종 이상의 조류 종을 포함하는 대규모 다중 모드 조류 종 데이터셋인 MAviS-Dataset을 개발했습니다. 이 데이터셋은 사전 훈련 및 지시 기반 미세 조정 데이터 세트로 구성되어 있으며, 구조화된 질의-응답 쌍으로 풍부하게 구성되어 있습니다. MAviS-Dataset을 기반으로, 저희는 음성, 시각, 텍스트를 지원하며 정밀한 종 이해, 다중 모드 질의 응답, 장면별 설명 생성에 특화된 다중 모드 LLM인 MAviS-Chat을 소개합니다. 마지막으로, 정량적 평가를 위해 저희는 조류 종별 시각적 및 추론 능력을 다양한 모드에서 평가하기 위한 25,000개 이상의 질의-응답 쌍으로 구성된 벤치마크인 MAviS-Bench를 제시합니다. 실험 결과는 MAviS-Chat이 기준 모델인 MiniCPM-o-2.6보다 훨씬 뛰어난 성능을 보이며, 최고 수준의 오픈 소스 결과를 달성했으며, 지시 기반 미세 조정된 MAviS-Dataset의 효과를 입증합니다. 이러한 결과는 생태학적 응용을 위한 도메인 적응형 다중 모드 LLM의 필요성을 강조합니다.

Original Abstract

Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.

3 Citations

0 Influential

19 Altmetric

98.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!