2603.04124v1 Mar 04, 2026 cs.AI

BeamPERL: 검증 가능한 보상을 활용한 파라미터 효율적인 강화 학습으로 구조적 빔 역학 추론에 특화된 소형 언어 모델

BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

Citations: 3

h-index: 1

Citations: 15

h-index: 2

강화 학습을 통해 하드웨어적으로 검증 가능한 보상을 사용하여 소형 언어 모델이 물리학적 추론을 수행하도록 학습시킬 수 있을까요? 아니면 모델은 단순히 정답을 맞추기 위한 패턴 매칭을 주로 학습할까요? 우리는 파라미터 효율적인 강화 학습(RLVR)을 사용하여 15억 개의 파라미터를 가진 추론 모델을 빔 정역학 문제에 대해 학습시켰습니다. 이 학습 과정에서 기호 해석기를 통해 얻은 이진 정확도 보상을 사용했으며, 교사 데이터로 생성된 추론 과정은 사용하지 않았습니다. 가장 성능이 좋은 BeamPERL 모델은 기준 모델 대비 Pass@1 지표에서 66.7%의 성능 향상을 보였습니다. 그러나 학습된 능력은 특정 방향으로 치우쳐 있습니다. 모델은 여러 하중이 작용하는 경우와 같이 구성적인 일반화는 잘 수행하지만, 지지점을 이동시키는 것과 같이 토폴로지 변화가 발생하는 경우에는 실패합니다. 중간 단계의 모델은 가장 강력한 추론 능력을 보였지만, 최적화 과정을 계속할수록 견고성이 저하되는 반면, 보상은 유지됩니다. 이러한 결과는 결과 기반 정렬의 중요한 한계를 보여줍니다. 정확한 물리학적 보상을 사용한 강화 학습은 통일 방정식의 내부화를 유도하기보다는 절차적인 해결 방식을 학습하게 합니다. 보상 신호의 정확성, 즉 분석적으로 정확하더라도, 물리적 추론의 전달성을 보장하지는 않습니다. 우리의 연구 결과는 검증 가능한 보상이 견고한 과학적 추론을 위해 템플릿 매칭을 넘어 나아가기 위해서는 구조화된 추론 프레임워크와 함께 사용되어야 함을 시사합니다.

Original Abstract

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

1 Citations

0 Influential

1 Altmetric

6.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!