2602.12173v1 Feb 12, 2026 cs.AI

SAM3-LiteText: 효율적인 비전-언어 분할을 위한 SAM3 텍스트 인코더의 해부학적 연구

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Chengxi Zeng

Citations: 33

h-index: 3

Yuxuan Jiang

Citations: 97

h-index: 6

Ge Gao

Citations: 166

h-index: 5

Shuai Wang

Citations: 100

h-index: 5

Duolikun Danier

Citations: 299

h-index: 9

Bin Zhu

Citations: 2,483

h-index: 10

Stevan Rudinac

Citations: 1,068

h-index: 18

David R. Bull

Citations: 632

h-index: 13

Fan Zhang

Citations: 122

h-index: 7

SAM3와 같은 비전-언어 분할 모델은 유연한 프롬프트 기반 시각적 그라운딩을 가능하게 하지만, 원래 개방형 언어 이해를 위해 설계된 거대하고 범용적인 텍스트 인코더를 그대로 사용합니다. 실제로 분할 프롬프트는 짧고 구조적이며 의미적으로 제한적이어서, 텍스트 인코더 용량의 상당한 과잉 공급과 지속적인 계산 및 메모리 오버헤드를 초래합니다. 본 논문에서는 여러 벤치마크에 걸친 404,796개의 실제 프롬프트를 대상으로 비전-언어 분할에서의 텍스트 프롬프팅에 대한 대규모 해부학적 분석을 수행합니다. 분석 결과 심각한 중복성이 드러났습니다. 대부분의 컨텍스트 윈도우는 충분히 활용되지 않고, 어휘 사용은 매우 희소하며, 텍스트 임베딩은 고차원 표현임에도 불구하고 저차원 매니폴드 상에 존재합니다. 이러한 발견에 기반하여, 우리는 기존 SAM3 텍스트 인코더를 지식 증류(knowledge distillation)로 최적화된 컴팩트한 MobileCLIP 스튜던트 모델로 대체하는 경량 텍스트 인코딩 프레임워크인 SAM3-LiteText를 제안합니다. 이미지 및 비디오 분할 벤치마크에 대한 광범위한 실험 결과, SAM3-LiteText는 원래 모델과 대등한 분할 성능을 유지하면서도 텍스트 인코더 파라미터를 최대 88% 줄여 정적 메모리 점유율을 대폭 감소시키는 것으로 나타났습니다. 코드: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext

Original Abstract

Vision-language segmentation models such as SAM3 enable flexible, prompt-driven visual grounding, but inherit large, general-purpose text encoders originally designed for open-ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over-provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large-scale anatomical analysis of text prompting in vision-language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low-dimensional manifold despite high-dimensional representations. Motivated by these findings, we propose SAM3-LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3-LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

0 Citations

0 Influential

58.753212762939 Altmetric

293.8 Score

Original PDF

383

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!