2603.08706v1 Mar 09, 2026 cs.AI

주체적인 비판적 훈련

Agentic Critical Training

Souradip Chakraborty

Citations: 1,204

h-index: 20

Zhejiang University

Citations: 38

h-index: 3

Minghui Liu

Citations: 4

h-index: 2

Sy-Tuyen Ho

Citations: 28

h-index: 2

Xiyao Wang

Citations: 350

h-index: 7

Furong Huang

Citations: 8

h-index: 2

대규모 언어 모델(LLM)을 자율 에이전트로 훈련하는 것은 종종 모방 학습으로 시작되지만, 이는 에이전트에게 무엇을 해야 하는지 가르칠 뿐, 왜 해야 하는지는 가르치지 않습니다. 에이전트는 성공적인 행동과 최적이 아닌 대안을 비교하지 않기 때문에 행동의 품질에 대한 인식이 부족합니다. 최근 연구에서는 전문가 행동과 대안 행동 간의 차이에서 파생된 자기 성찰 감독을 도입하여 이를 해결하려고 시도합니다. 그러나 훈련 패러다임은 여전히 근본적으로 모방 학습으로 남아 있습니다. 모델은 미리 구성된 성찰 텍스트를 모방하는 것이며, 자율적으로 추론하는 것을 학습하는 것이 아닙니다. 우리는 Agentic Critical Training (ACT)이라는 강화 학습 패러다임을 제안합니다. ACT는 에이전트가 대안 중에서 더 나은 행동을 식별하도록 훈련합니다. 모델의 판단이 정확한 경우 보상을 제공함으로써, ACT는 모델이 행동 품질에 대한 추론 능력을 자율적으로 개발하도록 유도하여 진정한 자기 성찰을 가능하게 합니다. 세 가지 어려운 에이전트 벤치마크에서, ACT는 다양한 사후 훈련 방법과 결합될 때 에이전트 성능을 꾸준히 향상시킵니다. ACT는 모방 학습에 비해 평균 5.07점, 강화 학습에 비해 4.62점의 향상을 보입니다. 또한, 지식 증류를 통해 성찰 능력을 주입하는 접근 방식과 비교했을 때, ACT는 평균 2.42점의 향상을 보여 명확한 장점을 입증합니다. 더욱이, ACT는 에이전트 벤치마크에서 뛰어난 일반화 성능을 보이며, 추론 관련 훈련 데이터 없이 일반적인 추론 벤치마크에서 성능을 향상시켜, 우리 방법의 가치를 강조합니다. 이러한 결과는 ACT가 더욱 성찰적이고 능숙한 LLM 에이전트를 개발하는 데 유망한 경로임을 시사합니다.

Original Abstract

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!