2604.13515v1 Apr 15, 2026 cs.LG

자동 형식화 과정에서 SFT-GRPO 데이터 중복을 사후 훈련 하이퍼파라미터로 활용하는 연구

SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

A. Lyu

Citations: 3

h-index: 1

Kasey Zhang

Citations: 0

h-index: 0

Xiaole Su

Citations: 0

h-index: 0

지도 미세 조정(SFT) 후 그룹 상대 정책 최적화(GRPO)는 일반적인 사후 훈련 방법입니다. 본 연구에서는 Lean 4 자동 형식화를 위해 학습된 Qwen3-8B 모델(사고 능력 비활성화)을 사용하여 SFT-GRPO 데이터 중복에 대한 통제된 분석을 수행했습니다. 분석은 훈련 방식에만 차이가 있는 6가지 조건에서 진행되었으며, 조건은 다음과 같습니다: 기본 모델, SFT만 적용, GRPO만 적용, 그리고 SFT와 GRPO를 결합한 세 가지 구성 (GRPO 프롬프트가 SFT 데이터와 각각 0%, 30%, 100% 중복되는 경우). SFT와 GRPO 데이터를 분리하여 사용하는 것이 데이터 중복을 사용하는 것보다 일관되게 더 나은 성능을 보였으며, 이는 추가적인 계산 비용 없이 달성되었습니다. Gaokao-Formal 및 PutnamBench 데이터셋에서 컴파일 성공률 및 LLM 판별 기준에 따른 의미론적 정확도를 평가한 결과, 데이터 중복 비율이 낮을수록 컴파일 및 의미론적 정확도가 높아지는 경향이 있었습니다. 데이터 중복 비율이 0%인 경우, GRPO는 SFT만 적용했을 때보다 Gaokao 데이터셋에서 10.4%p 더 높은 의미론적 정확도를 보였지만, 데이터 중복 비율이 100%인 경우에는 두 지표 모두 변화가 없어 GRPO 단계가 사실상 불필요해졌습니다. 또한, 두 가지 지표를 함께 평가한 결과, 가장 높은 컴파일 성공률을 보이는 모델에서도 30%p 이상의 컴파일-의미론적 격차가 존재한다는 것을 확인했으며, 이는 컴파일만 평가할 경우 드러나지 않는 문제입니다. 본 연구는 SFT-GRPO 데이터 중복을 사후 훈련 하이퍼파라미터로 활용하는 첫 번째 통제된 연구이며, 훈련 단계 간 데이터 공유 정도에 따라 모델의 동작 방식이 어떻게 달라지는지를 보여줍니다.

Original Abstract

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!