2605.01356v1 May 02, 2026 cs.LG

안전 정책 학습을 위한 모델 기반의 사전 비용 생성: 제한된 위반 데이터 환경에서의 오프라인 학습

Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

Ruiqi Xue

Citations: 69

h-index: 2

Lei Yuan

Nanjing University

Citations: 857

h-index: 16

Kai Cheng

Citations: 1

h-index: 1

Jingwen Yang

Citations: 119

h-index: 4

Yang Yu

Citations: 193

h-index: 6

위험한 온라인 상호작용 없이 오프라인 데이터를 활용하여 제약 조건을 만족하는 정책을 학습하는 것은 안전이 중요한 의사 결정에 필수적입니다. 기존 방법은 일반적으로 안전 경계를 정의하고 위반을 처벌하기 위해 풍부한 안전하지 않은 샘플로부터 비용 값 함수를 학습합니다. 그러나 고위험 시나리오에서는 위험한 시행착오가 불가능하며, 이는 거의 또는 전혀 안전하지 않은 샘플이 없는 데이터 세트를 초래합니다. 이러한 제약 조건 하에서, 기존 접근 방식은 종종 모든 데이터를 동일하게 안전하다고 간주하여 현재는 제약 조건을 만족하지만 몇 단계 내에 반드시 위반될 안전하지만 실행 불가능한 상태를 간과하게 되어, 실제 적용 실패를 초래합니다. 본 연구는 지식-데이터 통합의 개념에서 영감을 받아, 정책에 자연어 지식을 통합하기 위해 대규모 언어 모델(LLM)을 활용하여 이 문제를 해결하고자 합니다. 구체적으로, 우리는 PROCO라는 모델 기반의 오프라인 안전 강화 학습(RL) 프레임워크를 제안합니다. PROCO는 먼저 오프라인 데이터로부터 동역학 모델을 학습하고, LLM을 통해 안전하지 않은 상태에 대한 자연어 지식을 기반으로 보수적인 비용 함수를 구성하여, 관찰된 위반이 없더라도 위험 추정이 가능하도록 합니다. 학습된 비용 함수와 모델을 사용하여 PROCO는 모델 기반 시뮬레이션을 수행하여 다양한 가상 안전하지 않은 샘플을 생성하고, 이를 통해 안정적인 실행 가능성 판단과 실행 가능성을 고려한 정책 학습을 지원합니다. Safety-Gymnasium의 다양한 작업에서, 안전하거나 최소한의 위험만 있는 훈련 데이터를 사용했을 때, PROCO는 다양한 오프라인 안전 RL 알고리즘과 원활하게 통합되며, 원래 방법 및 기타 행동 복제 기반 방법과 비교하여 제약 조건 위반을 줄이고 안전 성능을 향상시키는 것을 지속적으로 보여줍니다.

Original Abstract

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!