2604.08401v1 Apr 09, 2026 cs.AI

커밋하기 전에 검증하세요: 자체 감사(Self-Auditing)를 통한 LLM 에이전트의 신뢰성 있는 추론

Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

Xuehe Wang

Citations: 32

h-index: 3

Jinfeng Xu

Citations: 348

h-index: 10

Edith C. H. Ngai

Citations: 344

h-index: 10

Wenhao Yuan

Citations: 27

h-index: 3

Chenchen Lin

Citations: 1

h-index: 1

Jian Chen

Citations: 0

h-index: 0

대규모 언어 모델(LLM) 에이전트에서 추론 경로는 행동을 안내하고 메모리를 업데이트하는 데 사용되는 신뢰할 수 있는 내부 정보로 간주됩니다. 그러나 일관된 추론이라 할지라도 논리적 또는 증거 기반 제약을 위반할 수 있으며, 이는 근거 없는 믿음이 반복적으로 저장되고 의사 결정 단계에 걸쳐 전파되어 장기적인 에이전트 시스템에서 체계적인 행동 변화를 초래할 수 있습니다. 기존의 대부분 전략은 합의 메커니즘에 의존하며, 이는 합의를 신뢰성과 동일시합니다. 본 논문에서는 신뢰할 수 없는 중간 추론 경로의 취약점에 착안하여, 에이전트 내의 내부 믿음 상태에 대해 행동 실행 전에 검증을 수행하여 신뢰성 있는 추론을 달성하는 새로운 프레임워크인 extbf{S}elf- extbf{A}udited extbf{Ve}rified extbf{R}easoning ( extsc{SAVeR})을 제안합니다. 구체적으로, 우리는 신뢰성과 관련된 구조 공간에서 선택을 위한 개인 기반의 다양한 후보 믿음을 구조적으로 생성합니다. 추론의 신뢰성을 달성하기 위해, 우리는 위반 사항을 식별하고 검증 가능한 수용 기준 하에서 제약 조건 기반의 최소한의 개입을 통해 수정하는 적대적 감사(adversarial auditing)를 수행합니다. 여섯 가지 벤치마크 데이터 세트에 대한 광범위한 실험 결과, 제안하는 방법은 추론의 신뢰성을 지속적으로 향상시키면서도 경쟁력 있는 최종 작업 성능을 유지하는 것으로 나타났습니다.

Original Abstract

In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!