2602.00982v1 Feb 01, 2026 cs.CV

간단하게 탐색하고, 깊이 정렬하라: 2025년 마우스 vs. AI 대회 우승 솔루션

Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025

Nguyen Lam Phu Quy

Citations: 1

h-index: 1

Chi-Nguyen Tran

Citations: 5

h-index: 1

Phu-Hoa Pham

Citations: 0

h-index: 0

Dao Sy Duy Minh

Citations: 1

h-index: 1

Huynh Trung Kiet

Citations: 1

h-index: 1

인공 지능 에이전트가 생물학적 시각 시스템과 경쟁할 수 있도록 개발하려면 시각적 안정성과 신경망 정렬이 여전히 중요한 과제입니다. 본 논문에서는 NeurIPS 2025 마우스 vs. AI: 강력한 시각 탐색 대회에서 Team HCMUS_TheFangs 팀이 두 트랙 모두에서 우승한 접근 방식을 소개합니다. 트랙 1(시각적 안정성)에서는 아키텍처의 단순성과 타겟 구성 요소의 결합이 뛰어난 일반화 성능을 제공하며, 게이티드 선형 유닛과 관측치 정규화를 적용한 경량 2계층 CNN을 사용하여 95.4%의 최종 점수를 달성했습니다. 트랙 2(신경망 정렬)에서는 16개의 컨볼루션 계층과 GLU 기반 게이팅을 갖춘 심층 ResNet과 유사한 아키텍처를 개발하여 17.8백만 개의 파라미터로 최고 수준의 신경망 예측 성능을 달성했습니다. 60,000에서 114만 단계 사이에서 학습된 10개의 모델 체크포인트에 대한 체계적인 분석 결과, 학습 시간이 성능과 비선형적인 관계를 가지며, 약 20만 단계에서 최적의 결과를 얻는다는 것을 확인했습니다. 종합적인 분석 연구 및 실패 사례 분석을 통해, 왜 더 단순한 아키텍처가 시각적 안정성에 더 뛰어나고, 더 깊은 모델이 더 큰 용량을 통해 더 나은 신경망 정렬을 달성하는지에 대한 통찰력을 제공합니다. 본 연구 결과는 시각-운동 학습에서의 모델 복잡성에 대한 기존의 가정을 뒤집고, 견고하고 생물학적으로 영감을 받은 시각 에이전트 개발을 위한 실질적인 지침을 제공합니다.

Original Abstract

Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!