2604.09855v1 Apr 10, 2026 cs.AI

강화 학습과 검증 가능한 보상을 활용하여 LLM에게 협상 능력을 부여하는 방법

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Yisong Yue

Citations: 19

h-index: 3

Claire Chen

Citations: 6

h-index: 1

David Simchi-Levi

Citations: 47

h-index: 4

Shuze Liu

Citations: 5

h-index: 1

Lei Lei

Citations: 3

h-index: 1

Yuheng Zhang

Citations: 8

h-index: 2

Jiabao Sean Xiao

Citations: 0

h-index: 0

최근 대규모 언어 모델(LLM)의 발전은 이들이 자율적인 상호 작용 에이전트로서의 잠재력을 가지고 있음을 보여주었습니다. 그러나 LLM은 불완전 정보 게임, 특히 양측 가격 협상과 같은 전략적 게임에서 어려움을 겪는 경우가 많습니다. 본 논문에서는 강화 학습을 통해 검증 가능한 보상(RLVR)이 LLM에게 협상 능력을 효과적으로 가르칠 수 있는지 조사합니다. 특히, 학습 과정에서 나타나는 전략적 행동을 탐구합니다. 저희는 중규모 구매자 에이전트를 훈련시켜 규제된 LLM 판매자와 다양한 실제 제품에 대해 상호 작용하도록 하는 프레임워크를 소개합니다. 경제적 잉여의 극대화와 엄격한 예산 제약 준수를 통해 직접적으로 보상 신호를 제공함으로써, 우리는 새로운 4단계 전략적 진화를 밝혀냅니다. 에이전트는 초보적인 협상에서 시작하여 공격적인 시작 가격을 사용하고, 교착 상태를 거쳐 궁극적으로 정교한 설득 기술을 개발합니다. 우리의 결과는 이와 같은 검증 가능한 훈련을 통해 300억 개의 파라미터를 가진 에이전트가 10배 더 큰 최첨단 모델보다도 잉여를 더 효과적으로 추출할 수 있음을 보여줍니다. 또한, 훈련된 에이전트는 훈련 중에 보지 못한 더 강력한 상대방에게도 안정적으로 일반화되며, 적대적인 판매자 페르소나에 직면하더라도 효과적인 성능을 유지합니다.

Original Abstract

The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!