2604.07809v1 Apr 09, 2026 cs.LG

PolicyLong: 온-정책 컨텍스트 확장을 향하여

PolicyLong: Towards On-Policy Context Extension

Ziyang Chen

Citations: 33

h-index: 3

Xing Wu

Citations: 77

h-index: 4

Junlong Jia

Citations: 8

h-index: 2

Chaochen Gao

Citations: 354

h-index: 8

Songlin Hu

Citations: 106

h-index: 6

Feng Zhang

Citations: 14

h-index: 2

Ting-Ting Yu

Citations: 38

h-index: 3

LLM의 컨텍스트 윈도우를 확장하는 데는 고품질의 긴 컨텍스트 데이터 부족이라는 어려움이 있습니다. 최근 연구에서는 정보 이론적 검증을 통해 장거리 의존성을 가진 데이터를 합성하는데, 이는 기준 모델의 예측 엔트로피를 줄이는 컨텍스트를 선택하는 방식입니다. 하지만 이러한 방법들은 고정된 모델을 사용한 단일 단계의 오프라인 방식으로 구축되므로, 근본적인 오프-정책 격차가 발생합니다. 즉, 정적인 선택 기준은 모델의 발전하는 능력과 일치하지 않아, 학습 데이터 분포가 왜곡되는 문제입니다. 우리는 PolicyLong을 제안합니다. PolicyLong은 데이터 구축 방식을 정적인 오프-정책 방식에서 동적인 온-정책 패러다임으로 전환합니다. PolicyLong은 현재 모델을 사용하여 데이터 선택(엔트로피 계산, 검색, 검증)을 반복적으로 수행함으로써, 학습 데이터 분포가 모델의 발전하는 능력에 맞춰 조정되도록 합니다. 결과적으로, 자기 주도 학습(emergent self-curriculum)이 가능해집니다. 더욱 중요한 점은, 긍정적인 컨텍스트와 어려운 부정적인 컨텍스트 모두 현재 모델의 엔트로피 지형에서 파생되므로, 모델이 학습하고 활용하는 것과 저항하는 것들이 함께 진화합니다. RULER, HELMET, LongBench-v2 (Qwen2.5-3B)에 대한 실험 결과, PolicyLong은 EntropyLong과 NExtLong보다 일관되게 우수한 성능을 보이며, 특히 긴 컨텍스트(예: RULER에서 128K의 경우 +2.54)에서 성능 향상이 두드러집니다. 이는 온-정책 데이터 진화의 가치를 입증합니다.

Original Abstract

Extending LLM context windows is hindered by scarce high-quality long-context data. Recent methods synthesize data with genuine long-range dependencies via information-theoretic verification, selecting contexts that reduce a base model's predictive entropy. However, their single-pass offline construction with a fixed model creates a fundamental off-policy gap: the static screening landscape misaligns with the model's evolving capabilities, causing the training distribution to drift. We propose PolicyLong, shifting data construction towards a dynamic on-policy paradigm. By iteratively re-executing data screening (entropy computation, retrieval, and verification) using the current model, PolicyLong ensures the training distribution tracks evolving capabilities, yielding an emergent self-curriculum. Crucially, both positive and hard negative contexts derive from the current model's entropy landscape, co-evolving what the model learns to exploit and resist. Experiments on RULER, HELMET, and LongBench-v2 (Qwen2.5-3B) show PolicyLong consistently outperforms EntropyLong and NExtLong, with gains growing at longer contexts (e.g., +2.54 at 128K on RULER), confirming the value of on-policy data evolution.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!