2601.22532v1 Jan 30, 2026 cs.LG

강화 학습 미세 조정의 설계 선택에 대한 오해를 풀기: 배치 기반 컨텍스트 방니트 학습 관점

Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective

Hong Xie

Citations: 5

h-index: 1

Tao Tan

Citations: 10

h-index: 2

Xin Li

Citations: 0

h-index: 0

Enhong Chen

Citations: 1,797

h-index: 20

Jianyu Han

Citations: 97

h-index: 1

Xiao Hu

Citations: 22

h-index: 1

Haoran Gu

Citations: 0

h-index: 0

Defu Lian

Citations: 13,393

h-index: 51

강화 학습 미세 조정 분야는 설계 선택을 최적화하는 연구가 급증하고 있지만, 성능 향상이 자주 주장되는 반면, 일관성 없는 결론이 때때로 나타나면서 진전이 불분명하게 느껴지는 경우가 있습니다. 이러한 현상에 대해 고민하면서, 우리는 여전히 다음과 같은 두 가지 근본적인 질문에 대한 명확한 답을 얻지 못하고 있습니다. 1) 각 설계 선택의 역할은 무엇인가? 2) 어떤 요소들이 중요한가? 본 논문은 이러한 질문에 대한 통찰력을 제공하고자 합니다. 주요 어려움은 설계 선택들이 서로 얽혀 있어, 학습 및 일반화에 대한 각 요소의 기여도를 명확하게 파악하기 어렵다는 점입니다. 이러한 어려움을 해결하기 위해, 우리는 요소를 분리하기 위한 최소한의 기준을 설정합니다. 이는 각 라운드마다 한 번의 실행, 어드밴티지 트릭 없이 결과 보상을 학습 신호로 사용, 그리고 배치 크기를 32로 하는 것입니다. 이러한 기준은 배치 기반 컨텍스트 방니트 학습과 연결되어, 실험적 분석을 용이하게 합니다. 이 기준을 중심으로, 어드밴티지, 실행 횟수 등과 같은 요소들의 추가적인 효과를 조사하는 실험 파이프라인을 설계했습니다. 세 가지 기본 모델과 두 가지 데이터 세트에 대한 실험 결과는 다양한 설계 선택이 학습 및 일반화 역학에 미치는 영향에 대한 새로운 이해를 제공할 뿐만 아니라, 더 많은 노력이 필요한 중요한 요소들을 식별합니다.

Original Abstract

The reinforcement fine-tuning area is undergoing an explosion papers largely on optimizing design choices. Though performance gains are often claimed, inconsistent conclusions also arise from time to time, making the progress illusive. Reflecting on this illusion, we still lack principled answers to two fundamental questions: 1) what is the role of each design choice? 2) which ones are critical? This paper aims to shed light on them. The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute. To address this challenge, we first construct a minimalist baseline for disentangling factors: one rollout per query in each round, the outcome reward serving as the training signal without any advantage trick, and a batch size of thirty-two. This baseline connects to batched contextual bandit learning, which facilitates experimental analysis. Centering around this baseline, we design an experiment pipeline, examining the marginal gains of factors like advantage, number of rollouts, etc. Experiments on three base models and two datasets, not only reveal new understanding on the role of various design choices on learning and generalization dynamics, but also identify critical ones that deserve more effort.

0 Citations

0 Influential

25.5 Altmetric

127.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!