2605.07141v1 May 08, 2026 cs.CV

Qwen3-VL-Seg: 시각-언어 기반 참조 분할을 통해 개방형 환경에서의 분할 성능 향상

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

Yuanzhi Yao

Citations: 1

h-index: 1

Qiushi Yang

Citations: 6

h-index: 1

Humen Zhong

Citations: 6,336

h-index: 6

Jiangning Wei

Citations: 45

h-index: 5

Yifang Men

Citations: 530

h-index: 9

Shuai Bai

Citations: 4

h-index: 1

Miaomiao Cui

Citations: 99

h-index: 4

Zhibo Yang

Citations: 8,292

h-index: 26

개방형 환경에서의 참조 분할은 제약 없는 언어 표현을 정확한 픽셀 수준 영역에 매핑하는 것을 요구합니다. 기존의 다중 모드 대규모 언어 모델(MLLM)은 뛰어난 개방형 시각적 매핑 능력을 보여주지만, 출력은 대부분 희소한 경계 상자 좌표에 제한되어 밀집된 시각적 예측에는 충분하지 않습니다. 최근의 MLLM 기반 분할 방법들은 직접적으로 희소한 윤곽선 좌표를 예측하여 연속적인 객체 경계를 재구성하는 데 어려움을 겪거나, Segment Anything Model (SAM)과 같은 외부 분할 기반 모델에 의존하여 상당한 구조적 및 배포 오버헤드를 발생시킵니다. 본 논문에서는 MLLM에서 예측된 상자를 의미론적으로 매핑된 구조적 사전 지식으로 활용하고, 이를 픽셀 수준의 참조 분할로 변환하는 효율적인 프레임워크인 Qwen3-VL-Seg을 제안합니다. Qwen3-VL-Seg은 가벼운 상자 기반 마스크 디코더를 핵심으로 하며, 다중 스케일 공간 특징 주입, 공간-의미론적 쿼리 구성, 상자 기반 고해상도 픽셀 융합, 반복적인 마스크 기반 쿼리 개선 기능을 포함하며, 전체 파라미터 수는 17M개로 기본 모델의 약 0.4%에 불과합니다. 확장 가능한 개방형 환경 학습을 위해 SA-1B에서 파생된 SA1B-ORS 데이터셋을 구축했으며, 이 데이터셋은 범주 중심 샘플인 SA1B-CoRS와 설명적, 인스턴스별 샘플인 SA1B-DeRS의 두 부분으로 구성됩니다. 평가를 위해 수동으로 검토된 벤치마크인 ORS-Bench를 구축했으며, 이는 다양한 참조 표현 유형을 포함하는 in-distribution 및 out-of-distribution 하위 집합으로 구성됩니다. 참조 표현 분할, 시각적 매핑 및 ORS-Bench에 대한 광범위한 실험 결과, Qwen3-VL-Seg은 폐쇄형 및 개방형 환경 모두에서 뛰어난 성능을 보이며, 특히 언어 기반 지침에서 우수한 성능을 보이고, 일반화 능력 또한 뛰어납니다. 일반적인 다중 모드 벤치마크에 대한 평가는 분할 지향적 적응 후에도 모델이 일반적인 다중 모드 능력을 크게 유지함을 보여줍니다.

Original Abstract

Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!