2604.06934v1 Apr 08, 2026 cs.CV

크로스 어텐션을 이용한 다중 모드 사용자 인터페이스 제어 요소 탐지

Multi-modal user interface control detection using cross-attention

Ke Yan

Citations: 37

h-index: 3

M. Moradi

Citations: 1

h-index: 1

David Colwell

Citations: 39

h-index: 3

Rhona Asgari

Citations: 121

h-index: 6

M. Samwald

Citations: 70

h-index: 5

소프트웨어 스크린샷에서 사용자 인터페이스(UI) 제어 요소를 탐지하는 것은 자동화된 테스트, 접근성 및 소프트웨어 분석에 있어 중요한 과제이지만, 시각적 모호성, 디자인의 다양성, 그리고 픽셀 기반 접근 방식의 맥락 정보 부족으로 인해 여전히 어려운 문제입니다. 본 논문에서는 GPT가 생성한 UI 이미지의 텍스트 설명을 크로스 어텐션 모듈을 통해 탐지 파이프라인에 통합하는 새로운 다중 모드 YOLOv5 확장 모델을 소개합니다. 시각적 특징과 텍스트 임베딩에서 파생된 의미 정보를 연결함으로써, 제안하는 모델은 보다 강력하고 맥락을 고려한 UI 제어 요소 탐지를 가능하게 합니다. 제안하는 프레임워크를 23개의 제어 클래스를 포함하는 16,000개 이상의 어노테이션된 UI 스크린샷으로 구성된 대규모 데이터셋에서 평가했습니다. 광범위한 실험을 통해 요소별 덧셈, 가중 합, 컨볼루션 퓨전 등 세 가지 퓨전 전략을 비교했으며, 모든 전략이 기본 YOLOv5 모델보다 일관된 성능 향상을 보였습니다. 이 중 컨볼루션 퓨전이 가장 강력한 성능을 보였으며, 특히 의미적으로 복잡하거나 시각적으로 모호한 클래스 탐지에서 상당한 개선을 이루었습니다. 이러한 결과는 시각적 및 텍스트 모달리티를 결합하면 UI 요소 탐지를 크게 향상시킬 수 있음을 보여주며, 특히 시각적 정보만으로는 충분하지 않은 경우에 더욱 그렇습니다. 본 연구 결과는 소프트웨어 테스트, 접근성 지원 및 UI 분석 도구의 신뢰성과 지능을 향상시키는 데 유망한 기회를 제공하며, 효율적이고 강력하며 일반화 가능한 다중 모드 탐지 시스템에 대한 미래 연구의 기반을 마련합니다.

Original Abstract

Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!