2601.09147v2 Jan 14, 2026 cs.CV

SSVP: 산업용 제로샷 이상 감지를 위한 시너지 효과를 내는 의미-시각 프롬프팅

SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection

Han Fang

Citations: 489

h-index: 3

Hao Sun

Citations: 40

h-index: 3

Wenbo Wei

Citations: 16

h-index: 2

Chenhao Fu

Citations: 96

h-index: 4

Xiuzheng Zheng

Citations: 0

h-index: 0

Yonghua Li

Citations: 3

h-index: 1

Xuelong Li

Citations: 41

h-index: 3

제로샷 이상 감지(ZSAD)는 비전-언어 모델(VLM)을 활용하여 감독 없이 산업 검사를 수행할 수 있도록 합니다. 그러나 기존의 ZSAD 방식은 단일 시각적 백본에 의존하며, 이는 전역적인 의미 일반화와 미세한 구조적 구별력을 균형 있게 유지하는 데 어려움을 겪습니다. 이러한 격차를 해소하기 위해, 우리는 다양한 시각적 인코딩을 효율적으로 결합하여 모델의 미세한 인식 능력을 향상시키는 시너지 효과를 내는 의미-시각 프롬프팅(SSVP)을 제안합니다. 구체적으로, SSVP는 계층적 의미-시각 시너지(HSVS) 메커니즘을 도입하여 DINOv3의 다중 스케일 구조적 사전 지식을 CLIP 의미 공간에 깊이 통합합니다. 그 후, 비전 기반 조건 프롬프트 생성기(VCPG)는 교차 모달 주의를 활용하여 동적 프롬프트 생성을 안내하며, 이를 통해 언어적 쿼리가 특정 이상 패턴에 정확하게 연결될 수 있도록 합니다. 또한, 전역 스코링과 로컬 증거 간의 불일치를 해결하기 위해, 시각-텍스트 이상 매퍼(VTAM)는 이중 게이트 교정 방식을 구축합니다. 7개의 산업용 벤치마크에 대한 광범위한 실험 결과, 저희 방법의 견고성이 검증되었으며, SSVP는 MVTec-AD 데이터셋에서 93.0%의 Image-AUROC와 92.2%의 Pixel-AUROC를 달성하여 기존의 제로샷 접근 방식보다 뛰어난 성능을 보였습니다.

Original Abstract

Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!