2604.17535v1 Apr 19, 2026 cs.CL

OPSDL: 온-폴리시 자기 증류를 이용한 장문 맥락 언어 모델

OPSDL: On-Policy Self-Distillation for Long-Context Language Models

Chun Kang

Citations: 29

h-index: 3

Tian Pan

Citations: 21

h-index: 2

Xinsen Zhang

Citations: 122

h-index: 6

Runkai Yang

Citations: 300

h-index: 4

Xuehan Xiong

Citations: 439

h-index: 4

Jing Gu

Citations: 103

h-index: 4

Zhe Ding

Citations: 19

h-index: 3

대규모 언어 모델(LLM)의 효과적인 맥락 길이를 확장하는 것은 실제 응용 분야에서 중요한 과제입니다. 최근의 사후 학습 방법들은 장문 맥락 확장에 어느 정도 진전을 이루었지만, 고품질의 지도 데이터에 의존하거나, 불안정하고 비효율적인 최적화를 야기하는 희소한 시퀀스 레벨 보상을 사용합니다. 본 연구에서는 LLM의 장문 맥락 능력을 향상시키는 온-폴리시 자기 증류 방법인 OPSDL을 제안합니다. 다른 자기 증류 방법들과 달리, OPSDL은 모델의 내재적인 강력한 단문 맥락 능력을 자체 교사(self-teacher)로 활용하여 장문 맥락 시나리오에서 모델 자체의 생성을 지도합니다. 모델은 먼저 전체 장문 맥락을 기반으로 응답을 생성하고, 그런 다음 자체 교사가 관련 추출된 단문 맥락 하에서 포인트별 역 KL 발산을 통해 토큰 단위의 지도 신호를 제공합니다. 이 밀집된 토큰 레벨 신호는 관련 증거의 충실한 사용을 촉진하고, 관련 없는 맥락으로 인해 발생하는 환각 현상을 완화합니다. 70억에서 320억 파라미터 규모의 다양한 모델에 대한 장문 맥락 벤치마크에서 OPSDL을 평가한 결과, 다양한 맥락 길이에서 일관되고 상당한 성능 향상을 보였으며, SFT 및 DPO와 같은 표준 사후 학습 방법을 능가하는 더 높은 샘플 효율성을 달성했습니다. 주목할 점은 이러한 성능 향상이 일반적인 단문 맥락 성능을 저하시키지 않고 달성되었습니다. 이러한 결과는 OPSDL이 확장 가능하고 안정적인 장문 맥락 학습 방법임을 보여줍니다.

Original Abstract

Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.

12 Citations

1 Influential

3 Altmetric

29.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!