2602.22683v1 Feb 26, 2026 cs.CV

SUPERGLASSES: 인공지능 스마트 글래스를 위한 지능형 에이전트로서의 비전 언어 모델 성능 평가

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

Zhuohang Jiang

Citations: 124

h-index: 3

Xu Yuan

Citations: 92

h-index: 5

Haohao Qu

Citations: 383

h-index: 7

Shanru Lin

Citations: 126

h-index: 3

Kanglong Liu

Citations: 20

h-index: 2

Wenqi Fan

Citations: 71

h-index: 3

Qing Li

Citations: 66

h-index: 3

인공지능 기반 스마트 글래스는 가장 유망한 웨어러블 기기 중 하나이며, 이는 다중 모드 상호작용의 새로운 가능성을 열었습니다. 특히, 외부 지식 소스를 활용한 시각 질의 응답(VQA)은 핵심 응용 분야로 부상하고 있습니다. 기존의 스마트 글래스에 적용된 비전 언어 모델(VLM)은 일반적으로 기존의 다중 모드 데이터셋으로 훈련되고 평가되지만, 이러한 데이터셋은 스마트 글래스 사용 시나리오를 반영하고, 정확한 객체 식별이 외부 지식 검색보다 우선해야 하는 특정 과제를 해결하기 위한 다양성과 현실성이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 스마트 글래스 장치로 수집된 실제 데이터를 기반으로 구축된 최초의 종합적인 VQA 벤치마크인 SUPERGLASSES를 소개합니다. SUPERGLASSES는 14개의 이미지 도메인과 8개의 질의 범주를 포괄하는 2,422개의 1인칭 시점 이미지-질의 쌍으로 구성되어 있으며, 전체 검색 경로 및 추론 주석이 추가되어 있습니다. 우리는 이 벤치마크를 사용하여 26개의 대표적인 VLM을 평가했으며, 상당한 성능 격차를 확인했습니다. 기존 모델의 한계를 극복하기 위해, 우리는 자동 객체 감지, 질의 분리 및 다중 모드 웹 검색을 통합하여 검색 기반 응답 생성을 가능하게 하는 다중 모드 스마트 글래스 에이전트인 SUPERLENS를 제안합니다. 우리의 에이전트는 최첨단 성능을 달성했으며, GPT-4o보다 2.19% 높은 성능을 보이며, 스마트 글래스 VQA 시나리오에서 작업별 솔루션의 필요성을 강조합니다.

Original Abstract

The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!