2605.01643v1 May 02, 2026 cs.LG

인센티브 및 교정을 통한 인공지능 정렬

AI Alignment via Incentives and Correction

Elad Hazan

Princeton University

Citations: 27,805

h-index: 65

Rohit Agarwal

Citations: 105

h-index: 4

J. Lin

Citations: 41

h-index: 2

Mark Braverman

Citations: 100

h-index: 2

본 연구에서는 억제 및 집행에 대한 법률경제 모델의 관점에서 인공지능 정렬 문제를 분석합니다. 이러한 모델에서, 위법 행위는 외부적인 실패 요인이 아닌, 인센티브에 대한 전략적인 대응으로 간주됩니다. 즉, 주체는 위반 행위로 얻을 수 있는 이익을 적발될 확률과 처벌의 심각성으로 비교하여 판단합니다. 우리는 이러한 동일한 논리가 에이전트 기반의 인공지능 시스템에서도 자연스럽게 나타난다고 주장합니다. 솔버는 설득력 있지만 부정확한 답변을 생성하거나, 불확실성을 숨기거나, 허위 단축 경로를 이용함으로써 이익을 얻을 수 있습니다. 반면, 감사자 또는 검증자는 비용이 많이 드는 모니터링이 가치가 있는지 판단해야 합니다. 따라서 정렬 문제는 고정점 문제이며, 더 강력한 처벌은 솔버의 잘못된 행동을 억제할 수 있지만, 동시에 감사자가 검사를 수행할 인센티브를 감소시킬 수 있습니다. 왜냐하면 검사를 통해 발생하는 비용은 점차 정렬되어 보이는 집단에게 부과되기 때문입니다. 이러한 관점은 학습 후 신호로 간주되어야 할 사항을 변화시킵니다. 일반적인 피드백은 종종 최종 답변에만 보상을 부여하지만, 솔버-감사자 파이프라인은 전체 교정 과정을 보여줍니다. 즉, 솔버가 오류를 범했는지, 감사자가 검사를 수행했는지, 오류가 발견되었는지, 그리고 감독 인센티브가 유지되었는지 여부를 파악할 수 있습니다. 우리는 이러한 상호 작용을 주체가 공동 교정 결과에 대해 보상을 선택하는 두 에이전트 모델로 공식화하여, 솔버의 행동과 감사자의 모니터링을 유도합니다. 따라서 보상 설계는 양층 최적화 문제이며, 보상은 즉각적인 의미보다는 유도하는 행동 균형에 의해 평가됩니다. 우리는 노이즈가 있는 상호 작용 피드백을 사용하여 보상 프로필을 검색하기 위한 밴딧 기반의 외부 루프 절차를 제안합니다. LLM 코딩 파이프라인에 대한 실험 결과, 적응적인 보상 프로필은 유용한 감독 압력을 유지하고, 정적으로 설계된 보상보다 주체가 원하는 결과를 개선할 수 있으며, 특히 환각으로 인한 부정확한 시도 횟수를 크게 줄일 수 있음을 확인했습니다.

Original Abstract

We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!