2603.08506v1 Mar 09, 2026 cs.LG

체스에서 안전한 수 예측을 위한 오라클 기반 소프트 쉴딩

Oracle-Guided Soft Shielding for Safe Move Prediction in Chess

Agnès Delaborde

Citations: 262

h-index: 10

Prajit T. Rajendran

Citations: 180

h-index: 9

Fabio Arnez

Citations: 3

h-index: 1

Huáscar Espinoza

Citations: 1

h-index: 1

C. Mraidha

Citations: 612

h-index: 15

고위험 환경에서, 순수 모방 학습 또는 강화 학습에 의존하는 에이전트는 탐색 과정에서 안전에 위협이 되는 오류를 피하는 데 어려움을 겪는 경우가 많습니다. 체스와 같은 환경에서 기존의 강화 학습 방법은 수렴하는 데 수십만 번의 에피소드와 상당한 계산 자원이 필요합니다. 반면, 모방 학습은 샘플 효율성이 높지만, 데이터 분포의 변화에 취약하며, 능동적인 위험 회피 메커니즘이 부족합니다. 본 연구에서는 안전한 의사 결정을 위한 간단하면서도 효과적인 프레임워크인 오라클 기반 소프트 쉴딩(Oracle-Guided Soft Shielding, OGSS)을 제안합니다. OGSS는 모방 학습 환경에서 오라클 피드백을 통해 확률적 안전 모델을 학습하여 안전한 탐색을 가능하게 합니다. 체스 도메인에 초점을 맞춰, 과거 경기 데이터를 기반으로 강력한 수를 예측하는 모델을 학습하고, Stockfish 평가를 사용하여 각 수의 전술적 위험을 추정하는 실책 예측 모델을 별도로 학습합니다. 추론 과정에서, 에이전트는 먼저 가능한 수의 집합을 생성하고, 실책 모델을 사용하여 위험도가 높은 옵션을 식별합니다. 그런 다음, 정책 모델에서 예측된 수의 확률과 실책 확률을 결합한 유틸리티 함수를 사용하여 성능과 안전 사이의 균형을 이루는 행동을 선택합니다. 이를 통해 에이전트는 전술적 실수를 크게 줄이면서 동시에 탐색하고 경쟁적으로 플레이할 수 있습니다. 강력한 체스 엔진과의 수백 번의 게임에서, 당사 방법론을 액션 가지치기, SafeDAgger 및 불확실성 기반 샘플링과 같은 기존 방법과 비교했습니다. 실험 결과는 OGSS 변형이 에이전트의 탐색 비율이 몇 배 증가하더라도 낮은 실책률을 유지하며, 전술적 안정성을 손상시키지 않고 더 넓은 범위의 탐색을 지원할 수 있음을 보여줍니다.

Original Abstract

In high stakes environments, agents relying purely on imitation learning or reinforcement learning often struggle to avoid safety-critical errors during exploration. Existing reinforcement learning approaches for environments such as chess require hundreds of thousands of episodes and substantial computational resources to converge. Imitation learning, on the other hand, is more sample efficient but is brittle under distributional shift and lacks mechanisms for proactive risk avoidance. In this work, we propose Oracle-Guided Soft Shielding (OGSS), a simple yet effective framework for safer decision-making, enabling safe exploration by learning a probabilistic safety model from oracle feedback in an imitation learning setting. Focusing on the domain of chess, we train a model to predict strong moves based on past games, and separately learn a blunder prediction model from Stockfish evaluations to estimate the tactical risk of each move. During inference, the agent first generates a set of candidate moves and then uses the blunder model to determine high-risk options, and uses a utility function combining the predicted move likelihood from the policy model and the blunder probability to select actions that strike a balance between performance and safety. This enables the agent to explore and play competitively while significantly reducing the chance of tactical mistakes. Across hundreds of games against a strong chess engine, we compare our approach with other methods in the literature, such as action pruning, SafeDAgger, and uncertainty-based sampling. Our results demonstrate that OGSS variants maintain a lower blunder rate even as the agent's exploration ratio is increased by several folds, highlighting its ability to support broader exploration without compromising tactical soundness.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!