2601.13142v1 Jan 19, 2026 cs.CV

TVWorld: 원격 제어 TV 에이전트를 위한 기반

TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma

Citations: 2

h-index: 1

Quanfeng Lu

Citations: 1,121

h-index: 10

Shuai Zhong

Citations: 64

h-index: 2

Dahai Yu

Citations: 3,240

h-index: 2

Ping Luo

Citations: 411

h-index: 4

Michael K. Ng

Citations: 0

h-index: 0

최근의 대규모 시각-언어 모델(LVLM)은 장치 제어에 강력한 잠재력을 보여주었습니다. 그러나 기존 연구는 주로 클릭(PnC) 상호 작용에 초점을 맞추고 있는 반면, 일상적인 TV 사용에서 흔히 볼 수 있는 원격 제어(RC) 상호 작용은 상대적으로 덜 연구되었습니다. 이러한 격차를 해소하기 위해, 우리는 실제 TV 탐색 과정을 오프라인 그래프 기반으로 추상화한 extbf{TVWorld}를 소개합니다. 이를 통해 재현 가능하고 배포 없이 평가를 수행할 수 있습니다. 이 기반을 바탕으로, 우리는 TV 사용 능력을 종합적으로 평가하는 두 가지 상호 보완적인 벤치마크, 즉 토폴로지 인지 탐색을 위한 extbf{TVWorld-N}과 집중 영역 인지 기반 연결을 위한 extbf{TVWorld-G}를 개발했습니다. 이러한 벤치마크는 기존 에이전트의 주요 한계를 보여줍니다. 즉, 집중 기반의 장기적인 TV 탐색을 위한 충분한 토폴로지 인지 능력이 부족하다는 것입니다. 이러한 발견에 동기 부여를 받아, 우리는 LVLM에 토폴로지 인지 능력을 주입하는 extit{토폴로지 인지 학습} 프레임워크를 제안합니다. 이 프레임워크를 사용하여, 우리는 TV 탐색에 특화된 기반 모델인 extbf{TVTheseus}를 개발했습니다. TVTheseus는 TVWorld-N에서 68.3%의 성공률을 달성하여, Gemini 3 Flash와 같은 강력한 비공개 기반 모델을 능가하며 최첨단(SOTA) 성능을 보였습니다. 추가적인 분석은 효과적인 TV 사용 에이전트 개발에 대한 귀중한 통찰력을 제공합니다.

Original Abstract

Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!