2603.10978v1 Mar 11, 2026 cs.CV

GroundCount: 객체 감지를 활용한 시각-언어 모델의 공간적 제약을 통해 계산 오류 현상을 완화하는 방법

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Minghao Shao

Citations: 454

h-index: 11

Boyuan Chen

Citations: 181

h-index: 5

Siddharth Garg

Citations: 130

h-index: 5

Ramesh Karri

Citations: 668

h-index: 14

Muhammad Shafique

Citations: 238

h-index: 6

시각-언어 모델(VLM)은 계산 작업에서 지속적인 환각 현상을 보이며, 다른 시각적 추론 작업(감성 분석 제외)에 비해 정확도가 현저히 낮습니다. 이러한 현상은 최첨단 추론 능력을 가진 VLM에서도 나타납니다. 반면, YOLO와 같은 CNN 기반 객체 감지 모델(ODM)은 공간적 위치 파악 및 인스턴스 개수 세기에서 뛰어난 성능을 보이며, 계산 비용이 상대적으로 적습니다. 본 논문에서는 VLM의 계산 오류를 줄이기 위해, 객체 감지 모델로부터 얻은 명시적인 공간 정보를 VLM에 통합하는 GroundCount라는 프레임워크를 제안합니다. 제안하는 프롬프트 기반 증강 전략은 가장 성능이 좋은 모델(Ovis2.5-2B)에서 81.3%의 계산 정확도를 달성하며, 이는 6.6%p의 향상된 수치입니다. 또한, 환각 현상으로 인한 불필요한 추론 과정을 제거하여 더 강력한 모델의 추론 시간을 22% 단축합니다. 포괄적인 분석 연구를 통해 위치 인코딩이 중요한 구성 요소임을 확인했습니다. 강력한 모델에는 유용하지만, 약한 모델에는 오히려 성능 저하를 초래할 수 있습니다. 반면, 신뢰도 점수는 대부분의 아키텍처에서 노이즈를 발생시키며, 이를 제거하면 5개 모델 중 4개에서 성능이 향상됩니다. 또한, 기능 수준의 융합 아키텍처를 평가한 결과, 정교한 크로스 어텐션 메커니즘을 사용한 암묵적 기능 융합보다 구조화된 프롬프트를 통한 명시적인 기호적 제약이 더 우수하다는 것을 확인했습니다. 제안하는 방법은 5개의 평가된 VLM 아키텍처 중 4개에서 일관된 성능 향상(6.2~7.5pp)을 보였으며, 하나의 아키텍처에서는 반복적인 반사 메커니즘과 구조화된 프롬프트 간의 호환성 문제로 인해 성능이 저하되었습니다. 이러한 결과는 계산 실패가 아키텍처 특정 결함보다는 근본적인 공간-의미 통합 능력의 한계에서 비롯된다는 것을 시사하며, 증강 전략에서 아키텍처 호환성의 중요성을 강조합니다.

Original Abstract

Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!