2602.03806v1 Feb 03, 2026 cs.LG

온라인 및 오프라인 강화 학습의 융합: 다중 턴 코드 생성을 위한 컨텍스트 밴딧 학습

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Huan Sun

Citations: 52

h-index: 4

Dongdong Chen

Citations: 178

h-index: 5

Ziru Chen

Citations: 21

h-index: 2

Ru Jin

Citations: 13

h-index: 2

Ying-Chang Liang

Citations: 34

h-index: 2

Yujia Xie

Citations: 10

h-index: 2

최근, 대규모 언어 모델(LLM)을 실제 작업, 특히 다중 턴 코드 생성에 강화 학습(RL)을 적용하는 연구가 활발히 진행되고 있습니다. 온라인 RL은 오프라인 RL보다 성능이 우수하지만, 높은 학습 비용과 불안정성으로 인해 널리 사용되지는 못합니다. 본 논문에서는 다중 턴 코드 생성이 단일 단계로 복구 가능한 마르코프 의사 결정 프로세스로 표현될 수 있다는 점을 바탕으로, 온라인 및 오프라인 RL의 장점을 결합하는 새로운 방법인 컨텍스트 밴딧 학습과 오프라인 트레일(Cobalt)을 제안합니다. Cobalt는 먼저 참조 LLM을 사용하여 코드 생성 트레일로젝트를 수집하고, 이를 부분 트레일로 분할하여 컨텍스트 프롬프트로 사용합니다. 그런 다음, 온라인 밴딧 학습 과정에서 LLM은 단일 단계의 코드 생성을 통해 각 부분 트레일 프롬프트를 완성하도록 훈련됩니다. Cobalt는 GRPO 및 VeRPO를 기반으로 하는 두 가지 다중 턴 온라인 RL 기준 모델보다 우수한 성능을 보이며, LiveCodeBench에서 R1-Distill 8B 및 Qwen3 8B 모델의 Pass@1 점수를 각각 9.0 및 6.2만큼 향상시켰습니다. 또한, LLM의 컨텍스트 기반 보상 조작(reward hacking) 행동을 분석하고, 이러한 문제를 완화하기 위해 섭동된 트레일로젝트를 사용하여 Cobalt 훈련을 보완했습니다. 전반적으로, Cobalt는 다중 턴 코드 생성과 같은 반복적인 의사 결정 작업에 대한 유망한 솔루션임을 보여줍니다. 저희의 코드 및 데이터는 https://github.com/OSU-NLP-Group/cobalt 에서 확인할 수 있습니다.

Original Abstract

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.

2 Citations

0 Influential

34.01292546497 Altmetric

172.1 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!