2601.03555v1 Jan 07, 2026 cs.AI

SCRIBE: 도구 사용 언어 모델을 위한 구조화된 중간 수준 지도

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Citations: 390

h-index: 3

Citations: 24

h-index: 3

신뢰할 수 있는 도구 증강 에이전트를 훈련하는 것은 다단계 추론 과정에서의 기여도 할당(credit assignment)의 어려움으로 인해 여전히 중요한 과제로 남아 있다. 과정 수준의 보상 모델이 유망한 방향을 제시하지만, 기존의 LLM 기반 평가자들은 상위 수준의 계획과 하위 수준의 실행을 구분할 수 있는 세밀하고 작업에 특화된 평가 기준이 부족하여 종종 잡음이 많고 일관성이 없는 신호를 생성한다. 본 연구에서는 새로운 중간 수준의 추상화 단계에 개입하는 강화 학습 프레임워크인 SCRIBE(Skill-Conditioned Reward with Intermediate Behavioral Evaluation)를 소개한다. SCRIBE는 보상 모델링을 선별된 기술 프로토타입 라이브러리에 기반을 두게 하여, 개방형 LLM 평가를 제약된 검증 문제로 전환한다. 각 하위 목표를 대응하는 프로토타입에 배정함으로써, 보상 모델은 보상 분산을 상당히 줄여주는 정밀하고 구조화된 평가 기준을 갖추게 된다. 실험 결과, SCRIBE는 다양한 추론 및 도구 사용 벤치마크에서 최고 수준(SOTA)의 성능을 달성한 것으로 나타났다. 특히 Qwen3-4B 모델의 AIME25 정확도를 43.3%에서 63.3%로 향상시켰으며, 복잡한 다중 턴 도구 상호작용에서의 성공률을 크게 높였다. 훈련 역학에 대한 추가 분석은 추상화 수준 간의 공진화를 보여주는데, 중간 수준 기술의 숙달이 효과적인 상위 수준 계획 행동의 발현보다 일관되게 선행하는 것으로 나타났다. 마지막으로, 우리는 SCRIBE가 하위 수준의 도구 최적화 기법들과 함께 사용될 때 추가적인 이점을 제공함을 입증하며, 더 자율적이고 신뢰할 수 있는 도구 사용 에이전트를 향한 확장 가능하고 상호보완적인 경로를 제시한다.

Original Abstract

Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents.

3 Citations

0 Influential

1.5 Altmetric

10.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!