2603.09827v1 Mar 10, 2026 cs.CV

MA-EgoQA: 다수의 에이전트로부터 수집된 1인칭 동영상에 대한 질문 응답

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

S. Hwang

Citations: 92

h-index: 3

Kangsan Kim

KAIST

Citations: 123

h-index: 5

Yanlai Yang

Citations: 107

h-index: 5

Suji Kim

Citations: 63

h-index: 3

Woongyeong Yeo

KAIST

Citations: 407

h-index: 5

Youngwan Lee

Citations: 37

h-index: 4

Mengye Ren

Citations: 50

h-index: 3

인공지능 모델이 발전함에 따라, 미래에는 인간이 직장이나 가정에서 여러 개의 자율 에이전트와 협력하게 될 것입니다. 인간 사용자와 다중 에이전트 시스템 간의 원활한 소통을 위해, 각 에이전트로부터 수집되는 정보를 실시간으로 해석하고, 각 질문에 적합한 맥락을 참조하는 것이 중요합니다. 기존의 과제는 다음과 같습니다: 비디오 형태로 제공되는 방대한 양의 개별 감각 데이터를 효율적으로 압축하고 전달하며, 여러 개의 1인칭 동영상을 정확하게 통합하여 시스템 수준의 기억을 구축하는 것입니다. 본 연구에서는, 여러 개의 자율 에이전트로부터 동시에 수집된 장시간의 1인칭 동영상을 이해하는 새로운 문제를 정의합니다. 이러한 연구를 촉진하기 위해, 본 연구에서는 MultiAgent-EgoQA (MA-EgoQA)라는 벤치마크를 제안합니다. MA-EgoQA는 다섯 가지 범주 (사회적 상호작용, 작업 조정, 정신 모델링, 시간 추론, 환경 상호작용)에 걸쳐 1.7k개의 질문을 포함하며, 이는 여러 1인칭 동영상 스트림에 특화되어 있습니다. 또한, 본 연구에서는 공유 메모리를 활용하고 에이전트별 동적 검색을 사용하는 간단한 기준 모델인 EgoMAS를 제안합니다. MA-EgoQA를 사용하여 다양한 기준 모델과 EgoMAS를 종합적으로 평가한 결과, 현재의 접근 방식은 여러 개의 1인칭 동영상 스트림을 효과적으로 처리하는 데 어려움이 있음을 확인했습니다. 이는 에이전트 간의 시스템 수준 이해에 대한 향후 발전의 필요성을 강조합니다. 코드 및 벤치마크는 https://ma-egoqa.github.io 에서 이용 가능합니다.

Original Abstract

As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.

5 Citations

2 Influential

2.5 Altmetric

21.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!