2604.08468v1 Apr 09, 2026 cs.LG

TTVS: 테스트 시간 변분 합성 기반 자기 탐색 강화 학습 성능 향상

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

Sikai Bai

Citations: 115

h-index: 5

Yongjiang Liu

Citations: 22

h-index: 3

Songyue Guo

Citations: 4

h-index: 2

Haoxi Li

Citations: 75

h-index: 5

Jie Zhang

Citations: 60

h-index: 5

검증 가능한 보상을 활용한 강화 학습(RLVR)을 통해 발전된 대규모 추론 모델(LRM)은 상당한 발전을 이루었지만, 이러한 패러다임은 전문 분야 또는 새로운 영역에서 감독 학습 데이터 확보가 어렵거나 비용이 많이 드는 경우, 즉 테스트 시간 적응(test-time adaptation)에 있어 근본적인 한계를 갖습니다. 기존의 테스트 시간 방법들은 잠재적인 해결책을 제시하지만, 정적인 쿼리 집합에서 학습하기 때문에 텍스트 패턴에 과적합될 위험이 있습니다. 이러한 문제점을 해결하기 위해, 본 연구에서는 테스트 시간 변분 합성(TTVS)이라는 새로운 프레임워크를 소개합니다. TTVS는 레이블이 없는 테스트 쿼리로부터 동적으로 학습 데이터를 증강하여 LRM이 스스로 발전하도록 지원합니다. TTVS는 두 가지 상호 보완적인 모듈로 구성됩니다. (1) 온라인 변분 합성(Online Variational Synthesis)은 정적인 테스트 쿼리를 다양한 의미적으로 동등한 변형들의 동적 스트림으로 변환하여 모델이 표면적인 패턴이 아닌 근본적인 문제 로직을 학습하도록 유도합니다. (2) 테스트 시간 하이브리드 탐색(Test-time Hybrid Exploration)은 정확도 중심의 활용(exploitation)과 일관성 중심의 탐색(exploration)을 합성된 변형체들 간에 균형 있게 조절합니다. 광범위한 실험 결과, TTVS는 8가지 모델 아키텍처에서 우수한 성능을 보였습니다. 특히, TTVS는 레이블이 없는 테스트 시간 데이터만을 사용하여 기존의 테스트 시간 적응 방법뿐만 아니라, 방대한 양의 고품질 레이블 데이터로 학습된 최첨단 지도 학습 기반 강화 학습 기술보다도 더 뛰어난 성능을 달성했습니다.

Original Abstract

Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables LRMs to self-evolve by dynamically augmenting the training stream from unlabeled test queries. TTVS comprises two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse, semantically-equivalent variations, enforcing the model to learn underlying problem logic rather than superficial patterns; (2) Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration across synthetic variants. Extensive experiments show TTVS yields superior performance across eight model architectures. Notably, using only unlabeled test-time data, TTVS not only surpasses other test-time adaptation methods but also outperforms state-of-the-art supervised RL-based techniques trained on vast, high-quality labeled data.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!