2606.09646v1 Jun 08, 2026 cs.CV

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Mohammadreza Salehi
Mohammadreza Salehi
Citations: 762
h-index: 6
Samuele Punzo
Samuele Punzo
Citations: 0
h-index: 0
N. Caselli
N. Caselli
Citations: 1,012
h-index: 16
Ippokratis Pantelidis
Ippokratis Pantelidis
Citations: 1
h-index: 1
Francesco Massafra
Francesco Massafra
Citations: 0
h-index: 0
Salvatore Sardo
Salvatore Sardo
Citations: 64
h-index: 4

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

0 Citations
0 Influential
8 Altmetric
40.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!