2602.04089v1 Feb 03, 2026 cs.AI

크로스-에피소드 메타-강화 학습을 통한 LLM의 온라인 학습 능력 확장

Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL

Xiaofeng Lin

Citations: 8

h-index: 2

Sirou Zhu

Citations: 76

h-index: 4

Yilei Chen

Citations: 10

h-index: 2

Mingyu Chen

Citations: 27

h-index: 3

Hejian Sang

Citations: 45

h-index: 3

I. Paschalidis

Citations: 6

h-index: 2

Zhipeng Wang

Citations: 9

h-index: 2

Aldo Pacchiano

Citations: 81

h-index: 3

Xuezhou Zhang

Citations: 8

h-index: 2

대규모 언어 모델(LLM)은 모든 작업 관련 정보가 사전에 제공되는 정적 예측 및 지시 따르기 문제에서 뛰어난 성능을 보입니다. 그러나 많은 실제 의사 결정 작업은 본질적으로 온라인적이며, 중요한 정보는 상호 작용을 통해 획득해야 하고, 피드백은 지연되며, 효과적인 행동은 정보 수집과 활용 사이의 균형을 시간 경과에 따라 맞추어야 합니다. 컨텍스트 내 학습은 가중치 업데이트 없이 적응을 가능하게 하지만, 기존 LLM은 이러한 환경에서 컨텍스트 내 상호 작용 경험을 안정적으로 활용하는 데 어려움을 겪는 경우가 많습니다. 본 연구에서는 이러한 제한 사항이 훈련을 통해 해결될 수 있음을 보여줍니다. 우리는 ORBIT라는 멀티 태스크, 멀티 에피소드 메타 강화 학습 프레임워크를 소개합니다. 이 프레임워크는 LLM이 컨텍스트 내에서 상호 작용을 통해 학습하도록 훈련합니다. 메타 훈련 후, 비교적 작은 오픈 소스 모델(Qwen3-14B)은 완전히 새로운 환경에서 상당한 수준의 컨텍스트 내 온라인 학습 능력을 보여주며, GPT-5.2의 성능에 필적하고 표준 강화 학습 미세 조정보다 훨씬 뛰어난 성능을 보입니다. 추가적인 확장 실험 결과, 모델 크기에 따른 일관적인 성능 향상이 관찰되었으며, 이는 추론 시 학습이 가능한 의사 결정 에이전트에 대한 상당한 잠재력을 시사합니다. 본 논문의 결과를 재현하는 코드는 https://github.com/XiaofengLin7/ORBIT 에서 확인할 수 있습니다.

Original Abstract

Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.

0 Citations

0 Influential

30.047189562171 Altmetric

150.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!