2604.23398v1 Apr 25, 2026 cs.AI

수정 힌트가 오히려 악영향을 미치는 경우: OWL~2~DL 하에서의 LLM의 과도한 신중함 문제를 해결하기 위한 프롬프트 설계 연구

When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL

Yujia Liu

Citations: 0

h-index: 0

Yijiashun Qi

Citations: 60

h-index: 6

Xiangjun Xu

Citations: 69

h-index: 2

본 연구에서는 GPT-5.4 모델이 OWL~2~DL 규정 준수 관련 질문에 대해 나타내는 재현 가능한 오류 패턴을 보고합니다. 모델은 추론 엔진에 의해 '아니오'로 판단되는 경우, 특히 extit{FunctionalProperty} 추론 또는 클래스 extit{disjointness} 조건 하에서 자주 '알 수 없음'으로 답변합니다. 우리는 관찰된 패턴의 절차적 확장을 통해 생성된 180개의 추론 엔진 검증 질문과 두 개의 무관한 도메인(보험 및 임상)에서 작성된 18개의 보류 질문을 사용하여 동일한 질문 수 내에서 네 가지 상호 작용 방식을 비교했습니다. 비교 방식은 다음과 같습니다. (1) 단일 시도, (2) 일반적인 '틀렸습니다'라는 피드백을 세 번 반복, (3) 추론 엔진의 판단과 함께 개방형 세계 가정(OWA) 힌트를 제공하여 세 번 수정, (4) 힌트 없이 추론 엔진의 판단만으로 세 번 수정. 직접적인 정확도는 43.9% (Wilson 95% CI [36.8, 51.2])였습니다. 일반적인 반복은 81.7% ($[75.4, 86.6]$)의 정확도를 달성했습니다. 판단과 힌트를 함께 사용하는 방식은 오히려 정확도가 67.2% ($[60.1, 73.7]$)로 가장 낮았습니다. 판단만 사용하는 방식은 97.8% ($[94.4, 99.1]$)의 정확도를 달성했습니다. 모든 쌍별 비교는 Bonferroni 교정을 적용한 McNemar의 정확 검정에서 통계적으로 유의미했습니다 ($α= 0.01$; 모든 p < 10^{-5}). 동일한 오류 패턴은 보류 질문의 4/4 사례에서 나타났습니다. 우리의 해석은 다음과 같습니다. 프롬프트의 구조가 수정 내용보다 더 중요할 수 있으며, 추론 엔진 기반의 보조 도구는 명시적으로 제거되어야 합니다.

Original Abstract

We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers ``unknown'' when the reasoner-entailed answer is ``no'' under \emph{FunctionalProperty} closure or class \emph{disjointness}. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong'' retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9\,\% (Wilson 95\,\% CI $[36.8,51.2]$); generic retry reaches 81.7\,\% ($[75.4,86.6]$); the verdict-with-hint variant is \emph{worse} at 67.2\,\% ($[60.1,73.7]$); the verdict-only variant reaches 97.8\,\% ($[94.4,99.1]$). All pairwise comparisons remain significant under McNemar's exact test with Bonferroni correction ($α= 0.01$; all $p < 10^{-5}$). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!