2601.06677v1 Jan 10, 2026 cs.LG

유연성 vs. 경직성: 저가형 환경에서의 추론 능력에 미치는 LoRA 어댑터의 영향

Plasticity vs. Rigidity: The Impact of Low-Rank Adapters on Reasoning on a Micro-Budget

Zohaib Khan

Citations: 20

h-index: 3

Omer Tafveez

Citations: 11

h-index: 2

Zoha Hayat Bhatti

Citations: 4

h-index: 1

최근의 수학적 추론 발전은 일반적으로 대규모 모델을 기반으로 하지만, 극히 제한된 환경(예: 15억 개 이하의 파라미터를 가진 소규모 언어 모델)에서도 강력한 추론 능력을 유도할 수 있는가라는 질문이 남아 있습니다. 본 연구에서는 강화 학습과 검증 가능한 보상(RLVR) 및 저랭크 어댑터(LoRA)를 사용하여 단일 A40 GPU(48GB)에서 24시간 이내에 모델을 훈련하여 이 문제를 조사합니다. 연구 결과, 이러한 '마이크로 예산' 환경에서의 성공 여부는 어댑터의 용량과 모델 초기화 사이의 상호 작용에 크게 의존한다는 것을 발견했습니다. 저랭크 어댑터(r=8)는 추론의 복잡한 최적화 과정을 효과적으로 반영하지 못하는 반면, 고랭크 어댑터(r=256)는 기존의 지시형 튜닝 모델에서 상당한 유연성을 발휘합니다. 가장 좋은 결과는 AIME 24에서 40.0%의 Pass@1을 달성하여 기준 모델 대비 11.1%p의 절대적인 성능 향상을 보였으며, Pass@16은 70.0%까지 향상되어 강력한 탐색 능력을 입증했습니다. 그러나 이러한 유연성은 보편적인 것이 아닙니다. 지시형 튜닝 모델은 예산을 활용하여 추론 과정을 확장하고 보상을 극대화하는 반면, 수학적 지식에 특화된 모델은 성능 저하를 겪었습니다. 이는 노이즈가 많고 예산이 제한된 강화 학습 업데이트가 특정 작업에 최적화된 상태에 있는 모델에 대해 파괴적인 간섭을 일으킬 수 있음을 시사합니다.

Original Abstract

Recent advances in mathematical reasoning typically rely on massive scale, yet the question remains: can strong reasoning capabilities be induced in small language models ($\leq1.5\text{B}$) under extreme constraints? We investigate this by training models on a single A40 GPU (48GB) for under 24 hours using Reinforcement Learning with Verifiable Rewards (RLVR) and Low-Rank Adaptation (LoRA). We find that the success of this ``micro-budget" regime depends critically on the interplay between adapter capacity and model initialization. While low-rank adapters ($r=8$) consistently fail to capture the complex optimization dynamics of reasoning, high-rank adapters ($r=256$) unlock significant plasticity in standard instruction-tuned models. Our best result achieved an impressive 40.0\% Pass@1 on AIME 24 (an 11.1\% absolute improvement over baseline) and pushed Pass@16 to 70.0\%, demonstrating robust exploration capabilities. However, this plasticity is not universal: while instruction-tuned models utilized the budget to elongate their chain-of-thought and maximize reward, heavily math-aligned models suffered performance collapse, suggesting that noisy, low-budget RL updates can act as destructive interference for models already residing near a task-specific optimum.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!