2601.16394v1 Jan 23, 2026 cs.CV

ResAgent: 엔트로피 기반 사전 포인트 탐색 및 시각적 추론을 통한 참조 표현 분할

ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation

Jusheng Zhang

Citations: 597

h-index: 15

Keze Wang

Citations: 499

h-index: 14

Yihao Wang

Citations: 50

h-index: 4

Ziyi Tang

Citations: 32

h-index: 2

Meng Yang

Citations: 28

h-index: 3

참조 표현 분할(RES)은 자유 형식의 언어 표현을 통해 픽셀 단위로 객체를 이해할 수 있도록 하는 핵심적인 시각-언어 분할 작업이며, 인간-로봇 상호작용 및 증강 현실과 같은 중요한 응용 분야를 지원합니다. 다중 모드 대규모 언어 모델(MLLM) 기반 접근 방식의 발전에도 불구하고, 기존 RES 방법은 여전히 두 가지 주요 한계를 가지고 있습니다. 첫째, MLLM에서 생성된 거친 경계 상자는 중복되거나 차별성이 없는 포인트 프롬프트를 유발합니다. 둘째, 텍스트 좌표 추론에 대한 널리 사용되는 의존성은 시각적으로 유사한 방해 요소와 객체를 구별하지 못하므로 신뢰성이 떨어집니다. 이러한 문제를 해결하기 위해, 우리는 **ResAgent**라는 새로운 RES 프레임워크를 제안합니다. 이 프레임워크는 **E**ntropy- extbf{B}ased Point **D**iscovery (**EBD**)와 **V**ision- extbf{B}ased **R**easoning (**VBR**)을 통합합니다. 구체적으로, EBD는 거친 경계 상자 내의 공간적 불확실성을 모델링하여 정보 최대화 프로세스로 포인트 선택을 수행함으로써, 높은 정보를 가진 후보 포인트를 식별합니다. VBR은 시각-의미 일관성을 통해 포인트의 정확성을 검증하며, 보다 강력한 검증을 위해 텍스트만 사용하는 좌표 추론을 포기합니다. 이러한 구성 요소를 기반으로, ResAgent는 거친 단계에서부터 세밀한 단계까지의 워크플로우를 구현합니다. 즉, 경계 상자 초기화, 엔트로피 기반 포인트 탐색, 시각 기반 검증, 마스크 디코딩 단계를 거칩니다. 네 가지 벤치마크 데이터 세트(RefCOCO, RefCOCO+, RefCOCOg 및 ReasonSeg)에 대한 광범위한 실험 결과는 ResAgent가 모든 벤치마크에서 새로운 최고 성능을 달성하며, 최소한의 프롬프트를 사용하여 정확하고 의미적으로 기반을 둔 분할 마스크를 생성하는 데 효과적임을 보여줍니다.

Original Abstract

Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!