2602.08041v1 Feb 08, 2026 cs.LG

암묵적 전략 최적화: 적대적 포커 환경에서의 장기 의사 결정 방식 재고

Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments

Boyang Xia

Citations: 9

h-index: 1

Weiyou Tian

Citations: 15

h-index: 1

Qingnan Ren

Citations: 236

h-index: 3

Jiaqi Huang

Citations: 64

h-index: 2

Jie Xiao

Citations: 3

h-index: 1

Shuo Lu

Citations: 2

h-index: 1

Kai Wang

Citations: 139

h-index: 4

Lynn Ai

Citations: 27

h-index: 2

Eric Yang

Citations: 39

h-index: 4

Bill Shi

Citations: 27

h-index: 3

대규모 언어 모델(LLM) 에이전트를 적대적 게임에 훈련하는 것은 종종 승률과 같은 에피소드 기반 목표에 의해 주도됩니다. 그러나 장기적인 환경에서는 보상이 시간에 따라 변화하는 잠재적인 전략적 외부 요인에 의해 결정되므로, 단기적인 최적화 및 변동 기반 후회 분석은 예측 가능한 동적 환경에서도 무의미해질 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 각 에이전트가 현재의 전략적 상황을 예측하고 이를 사용하여 정책을 실시간으로 업데이트하는 예측 기반 프레임워크인 암묵적 전략 최적화(ISO)를 제안합니다. ISO는 장기적인 전략적 가치를 추정하는 전략적 보상 모델(SRM)과, 상황에 따라 최적의 학습 규칙을 적용하는 iso-grpo를 결합합니다. 우리는 상황적 후회와 균형 수렴에 대한 하위 선형 보장을 제시하며, 주요 항은 상황 예측 오류의 수에 따라 결정됩니다. 예측 오류가 제한될 경우, 우리의 보장은 전략적 외부 요인이 알려진 정적 게임에서 얻을 수 있는 결과와 일치합니다. 6인 No-Limit Texas Hold'em 및 경쟁적인 Pokemon 환경에서의 실험 결과, ISO는 강력한 LLM 및 강화 학습 기반 모델보다 장기적인 수익 측면에서 일관되게 개선된 성능을 보였으며, 제어된 예측 노이즈 하에서도 안정적인 성능을 유지했습니다.

Original Abstract

Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long-horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation-based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction-aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long-run strategic value of actions with iso-grpo, a context-conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static-game rates obtained when strategic externalities are known. Experiments in 6-player No-Limit Texas Hold'em and competitive Pokemon show consistent improvements in long-term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!