2602.22897v1 Feb 26, 2026 cs.AI

OmniGAIA: 다중 모드 인공지능 에이전트 개발을 향한 노력

OmniGAIA: Towards Native Omni-Modal AI Agents

Guanting Dong

Citations: 972

h-index: 11

Zhicheng Dou

Citations: 2,131

h-index: 24

Wenxiang Jiao

Citations: 61

h-index: 3

Jiajie Jin

Citations: 1,433

h-index: 13

Xiaoxi Li

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China

Citations: 1,150

h-index: 11

Shijian Wang

Citations: 59

h-index: 3

Hao Wang

Citations: 742

h-index: 4

Yinuo Wang

Citations: 11

h-index: 2

Ji-Rong Wen

Citations: 1,867

h-index: 15

Jiarui Jin

Citations: 30

h-index: 4

Yuan Lu

Citations: 0

h-index: 0

인간 지능은 시각, 청각, 언어 등 다양한 감각 정보를 통합하여 복잡한 추론과 도구 사용을 통해 세상과 상호작용합니다. 그러나 현재의 다중 모드 LLM은 주로 이원적인 상호작용(예: 시각-언어)에 국한되어 있으며, 일반적인 AI 어시스턴트에 필요한 통합적인 인지 능력이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 심층적인 추론과 비디오, 오디오, 이미지 모드에 걸쳐 다단계 도구 실행이 필요한 작업을 평가하기 위한 포괄적인 벤치마크인 OmniGAIA를 소개합니다. OmniGAIA는 새로운 다중 모드 이벤트 그래프 접근 방식을 통해 실제 데이터를 기반으로 복잡하고 다단계 쿼리를 합성하며, 이는 상호 모드 추론과 외부 도구 통합을 요구합니다. 또한, 우리는 능동적인 다중 모드 인식을 갖춘 도구 통합 추론 패러다임을 기반으로 하는 기본 다중 모드 에이전트인 OmniAtlas를 제안합니다. OmniAtlas는 후행 분석 기반 트리 탐색 전략과 OmniDPO를 사용하여 미세 조정된 오류 수정 과정을 통해 기존 오픈 소스 모델의 도구 사용 능력을 효과적으로 향상시킵니다. 이 연구는 실제 시나리오를 위한 차세대 다중 모드 AI 어시스턴트 개발에 한 걸음 더 나아가는 것입니다.

Original Abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

5 Citations

1 Influential

12 Altmetric

67.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!