2601.14888v1 Jan 21, 2026 cs.LG

추론 LLM에서 양자화 인식 학습이 효과적인 이유는 무엇인가? 체계적인 연구

What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study

Xiaobo Xia

Citations: 258

h-index: 6

Manyi Zhang

Citations: 12

h-index: 2

Haoli Bai

Citations: 324

h-index: 7

Xianzhi Yu

Citations: 7

h-index: 1

Keyu Lv

Citations: 1

h-index: 1

Jingchen Ni

Citations: 2

h-index: 1

Shannan Yan

Citations: 8

h-index: 1

Lu Hou

Citations: 261

h-index: 6

Chun Yuan

Citations: 144

h-index: 4

추론 모델은 코딩 및 수학과 같은 복잡한 작업에서 뛰어난 성능을 보이지만, 추론 속도가 느리고 토큰 효율성이 떨어지는 경우가 많습니다. 추론 효율성을 향상시키기 위해, 사후 양자화(PTQ)는 일반적으로 정확도 저하를 수반하며, 특히 저비트 환경에서의 추론 작업에서 이러한 현상이 두드러집니다. 본 연구에서는 추론 모델을 위한 양자화 인식 학습(QAT)에 대한 체계적인 실증 연구를 수행했습니다. 주요 결과는 다음과 같습니다. (1) 지도 학습 또는 강화 학습을 통해 훈련된 추론 모델의 경우, 지식 증류는 강력한 학습 목표가 됩니다. (2) PTQ는 QAT의 강력한 초기화 방법으로, 정확도를 향상시키면서도 학습 비용을 절감합니다. (3) 적절한 초기 조건이 주어지면, 강화 학습은 양자화된 모델에서도 여전히 가능하며 추가적인 성능 향상을 가져옵니다. (4) PTQ의 보정 영역과 QAT의 학습 영역을 일치시키면 수렴 속도가 빨라지고 최종 정확도가 향상되는 경우가 많습니다. 마지막으로, 이러한 결과를 종합하여 최적화된 워크플로우(Reasoning-QAT)를 제시하고, 다양한 LLM 구조와 추론 데이터셋에서 최첨단 PTQ 방법보다 일관되게 우수한 성능을 보임을 보여줍니다. 예를 들어, Qwen3-0.6B 모델에서 MATH-500 데이터셋에서 GPTQ보다 44.53% 더 높은 성능을 보였으며, 2비트 환경에서도 성능이 회복되는 것을 확인했습니다.

Original Abstract

Reasoning models excel at complex tasks such as coding and mathematics, yet their inference is often slow and token-inefficient. To improve the inference efficiency, post-training quantization (PTQ) usually comes with the cost of large accuracy drops, especially for reasoning tasks under low-bit settings. In this study, we present a systematic empirical study of quantization-aware training (QAT) for reasoning models. Our key findings include: (1) Knowledge distillation is a robust objective for reasoning models trained via either supervised fine-tuning or reinforcement learning; (2) PTQ provides a strong initialization for QAT, improving accuracy while reducing training cost; (3) Reinforcement learning remains feasible for quantized models given a viable cold start and yields additional gains; and (4) Aligning the PTQ calibration domain with the QAT training domain accelerates convergence and often improves the final accuracy. Finally, we consolidate these findings into an optimized workflow (Reasoning-QAT), and show that it consistently outperforms state-of-the-art PTQ methods across multiple LLM backbones and reasoning datasets. For instance, on Qwen3-0.6B, it surpasses GPTQ by 44.53% on MATH-500 and consistently recovers performance in the 2-bit regime.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!