2606.09646v1 Jun 08, 2026 cs.CV

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Mohammadreza Salehi

Citations: 762

h-index: 6

Samuele Punzo

Citations: 0

h-index: 0

N. Caselli

Citations: 1,012

h-index: 16

Ippokratis Pantelidis

Citations: 1

h-index: 1

Francesco Massafra

Citations: 0

h-index: 0

Salvatore Sardo

Citations: 64

h-index: 4

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!