2602.09856v1 Feb 10, 2026 cs.CV

Code2World: 렌더링 가능한 코드 생성을 통한 GUI 월드 모델

Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng

Citations: 5

h-index: 1

Liangjun Zhong

Citations: 63

h-index: 3

Yi Wang

Citations: 1

h-index: 1

Rui Dai

Citations: 5

h-index: 1

Kaikui Liu

Citations: 249

h-index: 5

Xiangxiang Chu

Citations: 21

h-index: 3

Linyuan Lv

Citations: 0

h-index: 0

Philip Torr

Citations: 25

h-index: 2

Kevin Qinghong Lin

Citations: 7

h-index: 1

자율적인 GUI 에이전트는 인터페이스를 인식하고 액션을 실행함으로써 환경과 상호작용합니다. 가상 샌드박스인 GUI 월드 모델은 액션 기반 예측을 가능하게 하여 에이전트에게 인간과 유사한 예측 능력을 부여합니다. 그러나 기존의 텍스트 및 픽셀 기반 접근 방식은 높은 시각적 충실도와 세밀한 구조 제어를 동시에 달성하는 데 어려움을 겪습니다. 이에, 본 연구에서는 렌더링 가능한 코드 생성을 통해 다음 시각적 상태를 시뮬레이션하는 비전-언어 코더인 Code2World를 제안합니다. 특히, 데이터 부족 문제를 해결하기 위해, GUI 트레이저리를 고품질 HTML로 변환하고 시각적 피드백 기반 수정 메커니즘을 통해 생성된 코드를 개선하여 80,000개 이상의 고품질 화면-액션 쌍으로 구성된 데이터셋인 AndroidCode를 구축했습니다. 기존의 비전-언어 모델(VLM)을 코드 예측에 적용하기 위해, 먼저 레이아웃 준수를 위한 지도 학습(SFT)을 초기 단계로 수행한 후, 렌더링 결과를 보상 신호로 사용하는 렌더링 인식 강화 학습(Render-Aware Reinforcement Learning)을 적용하여 시각적 의미 충실성과 액션 일관성을 강화합니다. 광범위한 실험 결과, Code2World-8B는 최첨단 성능을 달성하며, 경쟁 모델인 GPT-5 및 Gemini-3-Pro-Image와 견줄 만한 성능을 보였습니다. 특히, Code2World는 유연한 방식으로 다운스트림 네비게이션 성공률을 크게 향상시켜 AndroidWorld 네비게이션에서 Gemini-2.5-Flash의 성능을 +9.5% 향상시켰습니다. 코드 및 관련 자료는 https://github.com/AMAP-ML/Code2World에서 확인할 수 있습니다.

Original Abstract

Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

0 Citations

0 Influential

48.787476860139 Altmetric

243.9 Score

Original PDF

191

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!