2603.17729v1 Mar 18, 2026 cs.CV

SARE: 샘플 단위 적응적 추론을 통한 학습 불필요한 미세 수준 시각 인식

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang

Citations: 2

h-index: 1

Dalin He

Citations: 0

h-index: 0

Miao Pan

Citations: 4

h-index: 1

Ge Su

Citations: 1

h-index: 1

Wenqiao Zhang

Citations: 123

h-index: 4

Yifeng Hu

Citations: 255

h-index: 10

Tang Li

University of Delaware

Citations: 59

h-index: 5

Yuke Li

Citations: 20

h-index: 3

Xuhong Zhang

Citations: 2

h-index: 1

최근 대규모 시각-언어 모델(LVLM)의 발전으로 인해 학습 없이도 미세 수준 시각 인식(FGVR)이 가능해졌습니다. 하지만 하위 수준 범주의 고유한 시각적 모호성으로 인해 LVLM을 FGVR에 효과적으로 활용하는 것은 여전히 어려운 과제입니다. 기존 방법들은 이러한 어려움을 해결하기 위해 주로 검색 기반 또는 추론 기반의 패러다임을 채택하지만, 이들은 다음과 같은 두 가지 근본적인 한계에 직면합니다. (1) 모든 샘플에 동일한 추론 파이프라인을 적용하여 인식 난이도의 불균형을 고려하지 못하므로, 최적의 정확도와 효율성을 달성하지 못합니다. (2) 오류에 특화된 경험을 통합하고 재사용할 수 있는 메커니즘이 부족하여 유사한 어려운 경우에 반복적인 실패가 발생합니다. 이러한 한계를 극복하기 위해, 본 논문에서는 학습이 필요 없는 FGVR을 위한 샘플 단위 적응적 추론 프레임워크인 SARE를 제안합니다. 구체적으로, SARE는 빠른 후보 검색과 미세 수준 추론을 결합한 계층적 설계를 채택하며, 필요한 경우에만 후자를 활용합니다. 추론 과정에서 SARE는 자기 반성 경험 메커니즘을 통합하여 과거의 실패를 활용하여 추론 중에 전송 가능한 판별력을 제공하며, 이는 어떠한 파라미터 업데이트 없이 수행됩니다. 14개의 데이터 세트를 대상으로 수행한 광범위한 실험 결과, SARE는 최첨단 성능을 달성하는 동시에 계산 비용을 크게 줄이는 것을 확인했습니다.

Original Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!