2604.10397v1 Apr 12, 2026 cs.CV

비디오 내 인간-객체 상호작용 재고: 통합 감지 및 예측을 위한 시간적 집합 예측

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Kunyu Peng

Citations: 1,640

h-index: 20

Di Wen

Citations: 94

h-index: 5

Yufan Chen

Citations: 167

h-index: 7

Junwei Zheng

Karlsruhe Institute of Technology

Citations: 364

h-index: 11

Ruiping Liu

Citations: 188

h-index: 7

Jiale Wei

Citations: 26

h-index: 3

Yu Luo

Citations: 1

h-index: 1

Rainer Stiefelhage

Citations: 0

h-index: 0

비디오 기반 인간-객체 상호작용(HOI) 이해는 현재 진행 중인 상호작용을 감지하고 그 미래 변화를 예측하는 것을 모두 필요로 합니다. 그러나 기존 방법은 일반적으로 예측을 외부적으로 구성된 인간-객체 쌍을 기반으로 하는 다운스트림 예측 작업으로 취급하며, 이는 감지 및 예측 간의 통합적인 추론을 제한합니다. 또한, 현재 벤치마크에서 사용되는 희소한 키프레임 주석은 실제 미래 역학과의 시간적 불일치를 야기하여 예측 평가의 신뢰성을 저하시킬 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 VidHOI 및 Action Genome에서 파생된, 보다 신뢰할 수 있는 다중 시간 예측을 위한 시간적으로 보정된 벤치마크인 DETAnt-HOI와, 현재 쌍 상태에서 미래 상호작용을 잔여 변환으로 모델링하여 주체-객체 위치 추정, 현재 HOI 감지, 미래 예측을 동시에 수행하는 쌍 중심 프레임워크인 HOI-DA를 소개합니다. 실험 결과, 감지 및 예측 모두에서 일관된 성능 향상을 보였으며, 특히 더 긴 시간 지평에서 더 큰 향상을 보였습니다. 우리의 결과는 예측이 감지 학습과 함께 구조적 제약 조건으로 작용할 때 쌍 수준의 비디오 표현 학습에 가장 효과적이라는 것을 보여줍니다. 벤치마크 및 코드는 공개될 예정입니다.

Original Abstract

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!