2603.01104v1 Mar 01, 2026 cs.HC

이기 중심 공동 조종사: 웹 기반 스마트 글래스 에이전트를 활용한 보조 이기 중심 인공지능

Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI

Yukai Huang

Citations: 3

h-index: 1

Weitong Cai

Citations: 92

h-index: 3

Fengyi Fang

Citations: 15

h-index: 2

Youquan He

Citations: 1

h-index: 1

Jiankang Deng

Citations: 31

h-index: 2

Hang Zhang

Citations: 165

h-index: 5

Jifei Song

Citations: 43

h-index: 2

Yi Xie

Citations: 17

h-index: 2

Sicheng Yang

Citations: 3

h-index: 1

Shitong Sun

Citations: 3

h-index: 1

Zhensong Zhang

Citations: 7

h-index: 2

만약 웹 접속이 화면, 안정적인 책상, 심지어 자유로운 손을 필요로 하지 않는다면 어떨까요? 혼잡한 도시를 탐색하거나, 시력이 좋지 않거나, 인지 과부하를 경험하는 사람들에게, 인공지능 에이전트와 결합된 스마트 글래스는 웹을 일상 생활 전반에 걸쳐 항상 켜져 있는 보조 시스템으로 변환할 수 있습니다. 본 연구에서는 스마트 글래스에서 실행되는 웹 기반 신경-기호 프레임워크인 Egocentric Co-Pilot을 소개합니다. 이 프레임워크는 대규모 언어 모델(LLM)을 사용하여 인식, 추론 및 웹 도구 모음을 조율합니다. 이기 중심 추론 코어는 시간적 추론 체인과 계층적 컨텍스트 압축을 결합하여 연속적인 1인칭 비디오에 대한 장기 질문 응답 및 의사 결정 지원을 제공하며, 이는 단일 모델의 컨텍스트 창을 훨씬 뛰어넘습니다. 또한, 경량의 멀티모달 의도 계층은 잡음이 많은 음성과 시선을 구조화된 명령으로 변환합니다. 또한, 스트리밍 음성, 비디오 및 제어 메시지를 통합하여 스마트 글래스와 브라우저 간에 통일된 채널을 제공하는 클라우드 기반 WebRTC 파이프라인을 구현하고 평가했습니다. 동시에, 로컬 추론과 클라우드 오프로딩 간의 구체적인 균형을 보여주는 온프레미스 WebSocket 기본 구현을 제공합니다. Egolife 및 HD-EPIC 데이터셋에 대한 실험 결과, 경쟁력 있는 또는 최첨단 이기 중심 질문 응답 성능을 보여주었으며, 스마트 글래스를 사용한 인간-루프 연구에서는 주요 상용 시스템보다 높은 작업 완료율과 사용자 만족도를 나타냈습니다. 종합적으로 볼 때, 웹에 연결된 이기 중심 공동 조종사는 일상 생활에서 더 접근 가능하고 상황 인지적인 지원을 제공하는 실용적인 방법이 될 수 있습니다. Egocentric Co-Pilot은 웹 기반 통신 기본 원리 및 모듈화되고 감사 가능한 도구 사용을 기반으로 작동하며, 상황 인지적 이기 중심 인공지능으로부터 가장 큰 이점을 얻을 수 있는 사람들을 위한 보조, 항상 켜져 있는 웹 에이전트를 위한 구체적인 청사진을 제시합니다. 이 에이전트는 교육, 접근성 및 사회적 포용을 지원할 수 있습니다.

Original Abstract

What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model's context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!