2602.12268v2 Feb 12, 2026 cs.AI

CM2: 다중 턴 및 다중 단계의 에이전트 도구 사용을 위한 체크리스트 보상 기반 강화학습

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Xun Wang

Citations: 14

h-index: 1

Yebowen Hu

Citations: 325

h-index: 7

Sathish Indurthi

Citations: 184

h-index: 6

Shujian Liu

Yanshan University

Citations: 43

h-index: 3

Zhen Zhang

Citations: 533

h-index: 7

Kaiqiang Song

Citations: 139

h-index: 5

Weixiang Yan

Citations: 74

h-index: 2

Chenyang Zhao

Citations: 133

h-index: 3

Simin Ma

Citations: 18

h-index: 2

Xiaoyang Wang

Citations: 538

h-index: 5

X. Wang

Citations: 4

h-index: 1

Song Wang

Citations: 3

h-index: 1

Henry Peng Zou

University of Illinois Chicago

Citations: 635

h-index: 14

Haoyun Deng

Zoom Communications, Inc.

Citations: 100

h-index: 2

AI 에이전트는 다중 턴 사용자 상호작용을 통해 추론하고 외부 도구를 호출하여 실제 문제를 해결하는 데 점점 더 많이 사용되고 있습니다. 그러나 이러한 환경에 강화학습(RL)을 적용하는 것은 여전히 어렵습니다. 현실적인 목표는 종종 검증 가능한 보상이 부족하고 대신 개방형 행동을 강조하며, 다중 턴 및 다중 단계의 에이전트 도구 사용을 위한 강화학습은 아직 충분히 연구되지 않았습니다. 또한 실행 가능한 도구 환경을 구축하고 유지하는 데는 비용이 많이 들어 규모와 적용 범위를 제한합니다. 우리는 검증 가능한 결과 보상을 체크리스트 보상으로 대체하는 RL 프레임워크인 CM2를 제안합니다. CM2는 각 턴의 의도된 행동을 명시적인 근거와 구조화된 메타데이터를 갖춘 세밀한 이진 기준으로 분해하여, 개방형 평가를 보다 안정적인 분류 방식의 결정으로 변환합니다. 안정성과 정보성의 균형을 맞추기 위해, 우리의 방법은 보상 할당은 희소하게(sparse) 하되 평가 기준은 조밀하게(dense) 하는 전략을 채택합니다. 학습은 확장 가능한 LLM 시뮬레이션 도구 환경에서 수행되어 대규모 도구 세트를 위한 과도한 엔지니어링 작업을 피합니다. 실험 결과에 따르면 CM2는 지도 미세 조정(SFT) 모델에 비해 일관된 성능 향상을 보여줍니다. 8B 기본 모델에서 시작하여 8,000개의 예제로 구성된 RL 데이터 세트에서 학습한 CM2는 SFT 모델 대비 tau^-Bench에서 8점, BFCL-V4에서 10점, ToolSandbox에서 12점 향상되었습니다. 이 결과는 평가 모델(judging model)을 포함하여 비슷한 크기의 오픈소스 베이스라인과 일치하거나 심지어 이를 능가합니다. 따라서 CM2는 검증 가능한 보상에 의존하지 않고 다중 턴 및 다중 단계 도구 사용 에이전트를 최적화하기 위한 확장 가능한 방법을 제공합니다. 오픈소스 커뮤니티에서 제공하는 코드: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Original Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

1 Citations

0 Influential

38.51292546497 Altmetric

193.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!