2602.18540v1 Feb 20, 2026 cs.CV

Rodent-Bench

Laurence Aitchison

Citations: 138

h-index: 5

Thomas Heap

Citations: 29

h-index: 2

Emma N Cahill

Citations: 585

h-index: 12

Adriana Casado Rodriguez

Citations: 0

h-index: 0

우리는 다중모달 거대 언어 모델(MLLM)이 설치류 행동 영상을 주석 처리하는 능력을 평가하기 위해 설계된 새로운 벤치마크인 Rodent-Bench를 제시한다. 우리는 이 벤치마크를 사용하여 Gemini-2.5-Pro, Gemini-2.5-Flash 및 Qwen-VL-Max를 포함한 최첨단 MLLM을 평가하였으며, 그 결과 어떤 모델도 이 작업의 보조 도구로 사용될 수 있을 만큼 강력한 성능을 발휘하지 못함을 발견했다. 우리의 벤치마크는 사회적 상호작용, 그루밍(털 고르기), 긁기, 동결(freezing) 행동을 포함한 여러 행동 패러다임에 걸친 다양한 데이터셋을 포괄하며, 영상의 길이는 10분에서 35분에 달한다. 우리는 다양한 모델의 역량을 수용하기 위해 두 가지 버전의 벤치마크를 제공하며, 초 단위 정확도, 매크로 F1, 평균 정밀도, 상호 정보량, 매튜스 상관계수를 포함한 표준화된 평가 지표를 구축했다. 일부 모델이 특정 데이터셋(특히 그루밍 감지)에서 준수한 성능을 보였으나, 전반적인 결과는 시간적 분할, 긴 길이의 비디오 시퀀스 처리, 미묘한 행동 상태 구분에 있어 상당한 어려움이 있음을 드러낸다. 우리의 분석은 과학적 영상 주석을 위한 현재 MLLM의 주요 한계점을 파악하고 향후 모델 개발을 위한 통찰력을 제공한다. Rodent-Bench는 신경과학 연구에서 신뢰할 수 있는 자동 행동 주석을 향한 발전 상황을 추적하는 토대 역할을 한다.

Original Abstract

We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain datasets (notably grooming detection), overall results reveal significant challenges in temporal segmentation, handling extended video sequences, and distinguishing subtle behavioral states. Our analysis identifies key limitations in current MLLMs for scientific video annotation and provides insights for future model development. Rodent-Bench serves as a foundation for tracking progress toward reliable automated behavioral annotation in neuroscience research.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!