2603.14602v1 Mar 15, 2026 cs.CL

$PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Shubhashis Roy Dipta
Shubhashis Roy Dipta
University of Maryland, Baltimore County
Citations: 99
h-index: 5
Lichao Wang
Lichao Wang
Citations: 21
h-index: 3
Benjamin Z. Yao
Benjamin Z. Yao
Citations: 267
h-index: 5
Chenlei Guo
Chenlei Guo
Citations: 22
h-index: 2
Kun Zhou
Kun Zhou
Citations: 0
h-index: 0
Daniel Bis
Daniel Bis
Citations: 127
h-index: 6
Ruhi Sarikaya
Ruhi Sarikaya
Citations: 8
h-index: 2

Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.

0 Citations
0 Influential
3 Altmetric
15.0 Score

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!