2602.17053v3 Feb 19, 2026 cs.AI

RFEval: 대규모 추론 모델에서 반사실적 추론 개입을 통한 추론 충실성 벤치마킹

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Jaeyoung Do

Citations: 100

h-index: 5

Yunseok Han

Citations: 21

h-index: 3

Yejoo Lee

Citations: 7

h-index: 2

대규모 추론 모델(LRMs)은 강력한 성능을 보여주지만, 종종 그럴듯해 보이지만 실제 의사 결정 과정을 반영하지 못하는 근거를 생성하여 신뢰도와 믿음을 저하시킨다. 우리는 정확도와는 명시적으로 분리되어 있으며 테스트 가능한 두 가지 조건, 즉 입장 일관성(추론과 답변을 연결하는 일관된 입장)과 인과적 영향(명시된 추론이 출력 수준의 개입 하에서 인과적으로 답변을 도출함)으로 정의되는 추론 충실성에 대한 공식적인 프레임워크를 도입한다. 이를 구체화하기 위해, 통제된 출력 수준의 반사실적 개입을 통해 충실성을 조사하는 7가지 작업에 걸친 7,186개 인스턴스의 벤치마크인 RFEval을 제시한다. 12개의 오픈소스 LRM을 평가한 결과, 우리는 출력의 49.7%에서 불충실성을 발견했으며, 이는 주로 입장 불일치에서 비롯되었다. 실패 사례들은 수학 및 코드와 같이 취약하고 수렴적인 도메인에 집중되어 있으며, 모델의 규모보다는 훈련 후(post-training) 체제와 더 큰 상관관계를 보인다. 제품군 내 절제 연구(within-family ablations)에 따르면, 지도 미세 조정(supervised fine-tuning) 위에 현재의 RL(강화학습) 스타일 목표를 추가하면 정확도가 유지되더라도 추론 충실성이 감소할 수 있음을 보여준다. 결정적으로, 정확도는 충실성을 판단하기 위한 충분조건이 아니며 신뢰할 만한 대리 지표도 아니다. 모델과 작업을 통제했을 때 정확도와 충실성 간의 연관성은 약하며 통계적으로 무의미하다. 우리의 연구는 LRM 신뢰성을 검증하기 위한 엄격한 방법론을 확립하며, 신뢰할 수 있는 AI를 위해서는 올바른 결과뿐만 아니라 추론 과정의 구조적 무결성까지도 최적화할 필요가 있음을 보여준다. 코드와 데이터셋은 프로젝트 페이지(https://aidaslab.github.io/RFEval/)에서 확인할 수 있다.

Original Abstract

Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset can be found at project page: https://aidaslab.github.io/RFEval/

6 Citations

3 Influential

2.5 Altmetric

24.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!