2604.18161v1 Apr 20, 2026 cs.LG

미분 가능한 시뮬레이션은 더 나은 정책 경사(Policy Gradient)를 제공하는가? (Does "Do Differentiable Simulators Give Better Policy Gradients?" Give Better Policy Gradients?)

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

Yutaka Matsuo

Citations: 1,565

h-index: 13

Ku Onoda

Citations: 0

h-index: 0

Paavo Parmas

Citations: 247

h-index: 6

Manato Yaguchi

Citations: 8

h-index: 2

정책 경사 강화 학습에서, 미분 가능한 모델을 활용하면 1차 기울기 추정이 가능하여, 파생 함수를 사용하지 않는 0차 추정 방식에 비해 학습 속도를 향상시킬 수 있습니다. 그러나 불연속적인 동역학은 편향을 유발하고 1차 추정기의 효과를 저해합니다. 이전 연구에서는 REINFORCE 0차 추정기의 신뢰 구간을 구성하고, 이러한 경계를 사용하여 불연속성을 감지함으로써 이러한 편향을 해결하고자 했습니다. 그러나 REINFORCE 추정기는 매우 노이즈가 심하며, 우리는 이 방법이 작업에 특화된 하이퍼파라미터 튜닝이 필요하며 샘플 효율성이 낮다는 것을 발견했습니다. 본 논문에서는 이러한 편향이 주요 장애물인지, 그리고 어떤 최소한의 수정이 필요한지 질문합니다. 먼저, 이전 연구에서 제시된 표준적인 불연속 환경을 재검토하고, DDCG라는 경량 테스트를 소개합니다. DDCG는 부드럽지 않은 영역에서 추정기를 전환하며, 단일 하이퍼파라미터만으로 안정적인 성능을 달성하고 적은 샘플에서도 신뢰성을 유지합니다. 둘째, 미분 가능한 로봇 제어 작업에서, 각 단계별 가중치 역산(Inverse-Variance Weighting)을 적용한 IVW-H라는 방법을 제시합니다. IVW-H는 명시적인 불연속성 감지 없이 분산을 안정화하며, 강력한 결과를 제공합니다. 이러한 연구 결과는, 추정기 전환이 통제된 환경에서 안정성을 향상시키는 데 도움이 될 수 있지만, 실제 적용에서는 주의 깊은 분산 제어가 더 중요함을 시사합니다.

Original Abstract

In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!