2604.01985v1 Apr 02, 2026 cs.LG

월드 액션 검증기: 순방향-역방향 비대칭성을 통한 자체 개선형 월드 모델

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

Yuejiang Liu

Citations: 74

h-index: 4

Chelsea Finn

Citations: 77

h-index: 5

Yilun Du

Citations: 100

h-index: 3

Jinzhou Tang

Citations: 2

h-index: 1

Fan Feng

Citations: 12

h-index: 3

Lingjing Kong

Citations: 41

h-index: 4

Kun Zhang

Citations: 27

h-index: 2

Kevin P. Murphy

Citations: 28

h-index: 3

Weifeng Lu

Citations: 26

h-index: 2

범용 월드 모델은 확장 가능한 정책 평가, 최적화 및 계획을 약속하지만, 필요한 수준의 안정성을 달성하는 것은 여전히 어려운 과제입니다. 정책 학습은 주로 최적의 행동에 초점을 맞추는 반면, 월드 모델은 최적의 행동 범위를 훨씬 뛰어넘는 다양한 비최적 행동에 대해 신뢰성을 유지해야 합니다. 이러한 비최적 행동은 종종 행동 레이블이 지정된 상호 작용 데이터에서 충분히 다루어지지 않습니다. 이러한 문제를 해결하기 위해, 우리는 월드 모델이 자체 예측 오류를 식별하고 자체 개선을 가능하게 하는 프레임워크인 월드 액션 검증기(World Action Verifier, WAV)를 제안합니다. 핵심 아이디어는 행동에 조건부인 상태 예측을 두 가지 요소, 즉 상태 가능성 및 행동 도달 가능성으로 분해하고, 각각을 개별적으로 검증하는 것입니다. 우리는 이러한 검증 문제가 미래 상태를 예측하는 것보다 훨씬 쉬울 수 있으며, 이는 다음과 같은 두 가지 근본적인 비대칭성 때문입니다. 첫째, 행동이 없는 데이터의 더 넓은 가용성, 둘째, 행동과 관련된 특징의 낮은 차원성입니다. 이러한 비대칭성을 활용하여, 우리는 (i) 비디오 코퍼스에서 얻은 다양한 하위 목표 생성기 및 (ii) 상태 특징의 일부에서 행동을 추론하는 희소 역 모델을 월드 모델에 추가합니다. WAV는 생성된 하위 목표, 추론된 행동 및 순방향 시뮬레이션 간의 순환 일관성을 적용함으로써, 기존 방법이 일반적으로 실패하는 미개척 영역에서 효과적인 검증 메커니즘을 제공합니다. MiniGrid, RoboMimic 및 ManiSkill에 걸쳐 9가지 작업에서, 우리의 방법은 샘플 효율성을 2배 향상시키면서 다운스트림 정책 성능을 18% 개선했습니다.

Original Abstract

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!