2604.02006v1 Apr 02, 2026 cs.AI

ProCeedRL: 탐색적 시연 기반 강화 학습을 통한 LLM 에이전트의 추론 능력 향상: 프로세스 기반 비판기

ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning

Yanjiang Guo

Citations: 924

h-index: 12

Jingyue Gao

Citations: 7

h-index: 1

Xiaoshuai Chen

Citations: 5

h-index: 2

Jianyu Chen

Citations: 1

h-index: 1

강화 학습(RL)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시키지만, 다단계 에이전트 작업에 적용하는 것은 상호 작용의 긴 시간 지평과 환경 피드백의 불확실성으로 인해 여전히 어려운 과제입니다. 우리는 에이전트 탐색에서 발생하는 구조적 오류 방식을 발견했습니다. 즉, 최적이 아닌 행동은 노이즈가 많은 관찰을 유발하여 잘못된 맥락을 형성하고, 이는 후속 의사 결정을 더욱 약화시켜 회복을 점점 더 어렵게 만듭니다. 이러한 오류의 누적적 피드백 루프는 표준 탐색 전략을 무력화하고 모델의 추론 능력과 환경의 무작위성에 취약하게 만듭니다. 이 문제를 해결하기 위해, 우리는 탐색을 수동적인 선택에서 능동적인 개입으로 전환하는 'ProCeedRL: 프로세스 기반 비판기 및 탐색적 시연 기반 강화 학습'을 제안합니다. ProCeedRL은 실시간으로 상호 작용을 모니터링하는 프로세스 수준의 비판기를 사용하며, 반성 기반 시연을 통합하여 에이전트가 오류의 누적을 중단하도록 안내합니다. 이 접근 방식은 모델의 포화된 탐색 성능을 크게 능가하며 상당한 탐색적 이점을 보여줍니다. ProCeedRL은 탐색적 시연과 온폴리시 샘플로부터 학습하여 탐색 효율성을 크게 향상시키고 복잡한 심층 검색 및 임베디드 작업에서 우수한 성능을 달성합니다.

Original Abstract

Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!