2602.02313v1 Feb 02, 2026 cs.AI

통합 정책 그래디언트(Integrated Policy Gradient)를 통한 LLM 추론의 해석 및 제어

Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

Kan Ren

Citations: 2

h-index: 1

Changming Li

Citations: 29

h-index: 2

Kaixin Zhang

Citations: 21

h-index: 2

Haoyun Xu

Citations: 36

h-index: 3

Yingdong Shi

Citations: 87

h-index: 3

Zheng Zhang

Citations: 78

h-index: 3

Kaitao Song

Citations: 8

h-index: 2

대규모 언어 모델(LLM)은 복잡한 실제 문제를 해결하는 데 있어 강력한 추론 능력을 보여줍니다. 그러나 이러한 복잡한 추론 행동을 유발하는 내부 메커니즘은 여전히 불투명합니다. 추론을 대상으로 하는 기존의 해석 가능성 접근 방식들은 특정 텍스트 패턴과 상관관계를 보이는 구성 요소(예: 뉴런)를 식별하거나, 제어 벡터를 도출하기 위해 사람이 주석을 단 대조 쌍에 의존합니다. 그 결과, 현재의 방법들은 복잡한 추론 메커니즘을 정확하게 국소화하거나 모델 내부 작동에서 추론 산출물로 이어지는 순차적인 영향력을 포착하는 데 어려움을 겪습니다. 본 논문에서는 결과 지향적이고 순차적 영향을 고려하는 원칙에 기반하여, 장기적인 효과에 의해 결과가 누적되는 추론 행동에 순차적으로 기여하는 구성 요소를 식별하는 데 초점을 맞춥니다. 우리는 추론 후 정확도와 같은 복합적인 결과 기반 신호를 모델 추론 궤적을 통해 역전파함으로써 추론 행동의 원인을 모델 내부 구성 요소에 귀속시키는 새로운 프레임워크인 통합 정책 그래디언트(IPG)를 제안합니다. 실증적 평가를 통해 우리의 접근 방식이 다양한 추론 모델 전반에서 더 정밀한 국소화를 달성하고 추론 행동(예: 추론 능력, 추론 강도)을 신뢰성 있게 조절할 수 있음을 입증합니다.

Original Abstract

Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model's inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.

3 Citations

0 Influential

1.5 Altmetric

10.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!