2601.14599v1 Jan 21, 2026 cs.LG

LLM 강화 학습 미세 조정에 대한 재고: 다중 팔 밴딧 학습 관점

Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

Hong Xie

Citations: 6

h-index: 2

Tao Tan

Citations: 10

h-index: 2

Jianyu Han

Citations: 97

h-index: 1

Xiao Hu

Citations: 23

h-index: 1

Defu Lian

Citations: 14,140

h-index: 54

LLM의 강화 학습 미세 조정을 최적화하기 위한 다양한 휴리스틱 방법들이 제안되어 왔습니다. 그러나, 때때로 일관성 없는 주장이 제기되어 이 분야를 어렵게 만듭니다. 이러한 상황을 고려할 때, 여전히 명확하게 이해되지 않는 두 가지 근본적인 질문이 있습니다. 1) 각 최적화 선택은 어떤 역할을 하는가? 2) 어떤 요소들이 병목 현상을 일으키는가? 본 논문은 이러한 질문에 대한 통찰력을 제공하는 것을 목표로 하며, 미세 조정 과정에서 발생하는 여러 복잡하게 얽힌 혼란 요인이라는 어려운 과제를 해결합니다. 이 과제를 해결하기 위해, 우리는 하향식 실험 파이프라인을 제안합니다. 가장 기본적인 레이어는 최소한의 구성으로 이루어져 있습니다: 하나의 학습 데이터, 각 라운드당 하나의 실행, 그리고 보상이 설계된 이점 함수 없이 직접 학습 신호로 사용됩니다. 이 최소한의 구성은 매우 큰 이산 행동 공간을 가진 다중 팔 밴딧 학습과 연결되어 있으며, 이는 실험 결과를 뒷받침할 수 있는 이론을 제공합니다. 실험 파이프라인의 상위 절차는 최소한의 구성 레이어를 계층적으로 확장하면서, 각 설계 선택의 역할을 검토합니다. 세 개의 LLM과 두 개의 추론 데이터셋에 대한 실험 결과는 설계 선택에 대한 새로운 이해를 제공할 뿐만 아니라, 이 분야를 발전시키는 데 필수적인 통찰력을 제공합니다.

Original Abstract

A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate the experiment findings. The up procedure of the experiment pipeline expanding the minimalist configuration layer by layer, examining the role of each design choice. Experimental results on three LLMs and two reasoning datasets not only reveal new understanding of the design choice but also yield essential insights to shape the area.

0 Citations

0 Influential

27 Altmetric

135.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!