2604.11365v1 Apr 13, 2026 cs.AI

대조를 통한 학습: 다양한 탐색 경로로부터 추론 경로 합성

Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

Di Liang

Citations: 48

h-index: 4

Peiyang Liu

Citations: 91

h-index: 6

Wei Ye

Citations: 72

h-index: 2

Zhirui Chen

Citations: 89

h-index: 4

Youru Li

Citations: 2

h-index: 1

Xi Wang

Citations: 147

h-index: 5

Zhipeng Cai

Citations: 8

h-index: 1

몬테 카를로 트리 검색(MCTS)은 자동 추론 및 데이터 탐색에 널리 사용되지만, 현재의 지도 추출 방법은 여전히 비효율적입니다. 기존 방식은 가장 높은 보상을 가진 단일 경로만을 유지하며, 탐색된 많은 경로에 내재된 비교 정보를 버립니다. 본 연구에서는 지도 추출을 필터링 과정에서 합성 과정으로 전환하는 프레임워크인 **대조적 추론 경로 합성(Contrastive Reasoning Path Synthesis, CRPS)**을 소개합니다. CRPS는 구조화된 반성 과정을 사용하여 고품질 및 저품질 탐색 경로 간의 차이를 분석하고, 전략적 전환 지점 및 지역적 실패 모드에 대한 명시적인 정보를 추출합니다. 이러한 통찰력은 성공 패턴을 통합하면서 식별된 함정을 피하는 추론 체인을 합성하는 데 사용됩니다. 실험 결과, CRPS로 합성된 6만 개의 예제를 사용하여 미세 조정된 모델은 표준 방식으로 생성된 59만 개의 예제를 사용하여 학습된 기준 모델의 성능에 필적하거나 능가하며, 데이터 세트 크기를 20배 줄일 수 있습니다. 또한, CRPS는 도메인 외부 벤치마크에서 일반화 성능을 향상시키며, 성공과 실패 간의 대조를 통해 학습하는 것이 성공만을 통해 학습하는 것보다 더 전이 가능한 추론 능력을 제공한다는 것을 보여줍니다.

Original Abstract

Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!