2604.27419v1 Apr 30, 2026 cs.AI

InteractWeb-Bench: 다중 모드 에이전트는 인터랙티브 웹사이트 생성에서 맹목적인 실행을 벗어날 수 있는가?

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

H. Alinejad-Rokny

Citations: 185

h-index: 8

Qiyao Wang

Citations: 207

h-index: 3

Yuan Lin

Citations: 20

h-index: 1

Longze Chen

Citations: 347

h-index: 10

Haoran Hu

Citations: 29

h-index: 2

Hongbo Wang

Citations: 41

h-index: 4

Min Yang

Citations: 24

h-index: 2

다중 모드 대규모 언어 모델(MLLM)과 코딩 에이전트의 발전으로 인해 웹사이트 개발은 수동 프로그래밍에서 에이전트 기반의 프로젝트 수준 코드 생성으로 전환되었습니다. 기존 벤치마크는 이상적인 가정에 의존하는데, 특히 잘 구조화되고 정보가 풍부한 입력과 정적인 실행 환경을 가정합니다. 반면, 실제 개발은 명확하지 않고 품질이 낮은 비전문가 사용자 지침과 모델의 이해 간의 의미적 불일치라는 중요한 병목 현상에 의해 제약됩니다. 이는 '맹목적인 실행'이라는 실패 방식으로 이어집니다. 이러한 격차를 해소하기 위해, 우리는 비전문가 사용자의 저코드 환경에서 웹사이트 생성을 위한 최초의 다중 모드 인터랙티브 벤치마크인 InteractWeb-Bench를 소개합니다. InteractWeb-Bench는 요구사항 공학 결함 분류를 기반으로 모호성, 중복성, 모순성을 포함한 다양한 사용자 행동을 체계적으로 시뮬레이션하기 위해 네 가지 유형의 사용자 에이전트와 페르소나 기반 지침 변형을 도입합니다. 우리는 에이전트를 위한 인터랙티브 실행 환경을 개발했으며, 여기에는 명확화, 구현, 검증, 제출을 포함하는 통합 액션 공간이 특징입니다. 이를 통해 반복적인 의도 개선, 코드 생성 및 시각적 피드백 기반 검증이 가능합니다. 광범위한 실험과 분석 결과, 최첨단 MLLM 기반 에이전트는 여전히 맹목적인 실행에 갇혀 있으며, 이는 의도 인식 및 적응형 상호 작용의 한계를 드러냅니다.

Original Abstract

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!