2602.24111v1 Feb 27, 2026 cs.CV

형식적 검증을 통한 시각 언어 모델의 임상 추론 보증 연구

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Gourav Datta

Citations: 4

h-index: 1

Debargha Ganguly

Citations: 34

h-index: 3

Vikash Singh

Case Western Reserve University

Citations: 14

h-index: 2

Vipin Chaudhary

Citations: 61

h-index: 5

Haotian Yu

Citations: 24

h-index: 4

Chengwei Zhou

Citations: 5

h-index: 1

Prerna Singh

Citations: 4

h-index: 2

Brandon Lee

Citations: 2

h-index: 1

시각 언어 모델(VLM)은 방사선 보고서 작성에 유망한 가능성을 보이지만, 종종 논리적 일관성 문제를 겪으며, 자체적인 시각적 분석 결과에 의해 뒷받침되지 않거나 논리적으로 도출되어야 하는 결론을 누락하는 경우가 발생합니다. 기존의 어휘 기반 측정 지표는 임상적 재해석을 과도하게 벌점 부과하며, 이러한 연역적 오류를 참조가 없는 환경에서 감지하지 못합니다. 본 연구는 임상 추론에 대한 보증을 제공하기 위해, VLM이 생성한 보고서의 내부 일관성을 결정적으로 검증하는 신경-기호 검증 프레임워크를 소개합니다. 개발된 파이프라인은 자유 형식의 방사선학적 소견을 구조화된 명제적 증거로 자동 변환하고, SMT 솔버(Z3)와 임상 지식 기반을 활용하여 각 진단 주장이 수학적으로 도출 가능한지, 환각인지, 또는 누락되었는지를 검증합니다. 다섯 가지 흉부 X선 벤치마크를 사용하여 7개의 VLM을 평가한 결과, 제안하는 검증기는 기존의 측정 지표로는 감지할 수 없는 보수적인 관찰 및 확률적 환각과 같은 다양한 추론 오류 방식을 드러냅니다. 레이블이 지정된 데이터 세트에서, 솔버 기반의 연역적 검증을 적용하면 엄격한 사후 보증 역할을 수행하여, 근거 없는 환각을 체계적으로 제거하고 생성형 임상 보조 시스템의 진단 정확도와 정밀도를 크게 향상시킵니다.

Original Abstract

Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!