2602.16154v1 Feb 18, 2026 cs.CL

다중 청취자 소프트 실행을 통한 추론의 충실성과 성능 균형

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

Elias Stengel-Eskin
Elias Stengel-Eskin
Citations: 891
h-index: 17
Hyunji Lee
Hyunji Lee
Citations: 14
h-index: 2
Nithin Sivakumaran
Nithin Sivakumaran
Citations: 3
h-index: 1
Shoubin Yu
Shoubin Yu
Citations: 1,011
h-index: 10
Yue Zhang
Yue Zhang
Citations: 17
h-index: 2
Ali Payani
Ali Payani
Citations: 0
h-index: 0
Mohit Bansal
Mohit Bansal
Citations: 886
h-index: 19

체인 오브 씽킹(CoT) 추론은 때때로 대규모 언어 모델(LLM)의 실제 계산 과정을 충실하게 반영하지 못하여, LLM이 어떻게 답변에 도달하는지 설명하는 데 어려움을 겪게 합니다. 또한, 추론 과정에서 충실성과 해석 가능성을 최적화하는 것은 종종 작업 성능을 저하시킵니다. 이러한 상충 관계를 해결하고 CoT의 충실도를 향상시키기 위해, 우리는 다자간 강화 학습 접근 방식인 Reasoning Execution by Multiple Listeners (REMUL)를 제안합니다. REMUL은 다른 참여자가 따라갈 수 있는 추론 과정이 더 충실할 것이라는 가설에 기반합니다. 스피커 모델은 추론 과정을 생성하고, 이 과정은 잘려 다른 청취자 모델 풀로 전달되어 추론 과정을

Original Abstract

Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.

0 Citations
0 Influential
9.5 Altmetric
47.5 Score

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!