2604.05125v1 Apr 06, 2026 cs.IR

사전 승인 과정에서 적응형 정책 검색을 위한 오프라인 강화 학습

Offline RL for Adaptive Policy Retrieval in Prior Authorization

Hanna Clay

Citations: 1

h-index: 1

Ruslan Sharifullin

Citations: 0

h-index: 0

M. Gorshkov

Citations: 9

h-index: 2

사전 승인(PA)은 복잡하고 분산된 보장 정책의 해석을 필요로 하지만, 기존의 검색 기반 시스템은 고정된 개수의 항목을 검색하는 정적 방식에 의존합니다. 이러한 고정된 검색 방식은 비효율적이며 관련 없는 또는 충분하지 않은 정보를 수집할 수 있습니다. 본 연구에서는 사전 승인 과정에서의 정책 검색을 순차적 의사 결정 문제로 모델링하고, 적응형 검색을 마르코프 결정 프로세스(MDP)로 정의합니다. 제안하는 시스템에서, 에이전트는 후보 항목 집합에서 정책 조각을 반복적으로 선택하거나, 검색을 중단하고 결정을 내립니다. 보상은 결정의 정확성과 검색 비용 사이의 균형을 맞추며, 정확성과 효율성 간의 절충점을 반영합니다. 공개된 CMS 보장 데이터를 기반으로 생성된 합성 사전 승인 요청에 대한 기본 검색 전략의 로그 데이터를 사용하여, Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), 및 Direct Preference Optimization (DPO)를 통해 정책을 학습합니다. 10개의 CMS 절차를 포괄하는 186개의 정책 조각으로 구성된 데이터셋에서, CQL은 광범위한 검색을 통해 92%의 결정 정확도를 달성하며, 이는 가장 우수한 고정-K 방식보다 30%p 향상된 수치입니다. IQL은 가장 우수한 기본 방식과 동일한 정확도를 달성하면서 검색 단계를 44% 줄이고, 모든 정책 중에서 유일하게 긍정적인 에피소드 수익을 달성합니다. Transition-level DPO는 CQL과 동일한 92%의 정확도를 달성하면서 검색 단계를 47% 줄입니다 (10.6단계 vs. 20.0단계). DPO는 CQL 및 BC를 능가하는 파레토 최적 지점인 "선택적-정확" 영역에 위치합니다. 행동 복제(BC) 방식이 CQL과 유사한 성능을 보이는 것으로 보아, 선택적 검색을 학습하려면 가중치 부여 또는 선호도 기반 정책 추출이 필요함을 알 수 있습니다. 단계 비용에 대한 람다(λ) 값을 0.05, 0.1, 0.2로 변화시켜 실험한 결과, 정확도-효율성의 변곡점을 명확하게 확인할 수 있었습니다. 람다 값이 0.2일 때, CQL은 광범위한 검색에서 선택적 검색으로 전환되는 것을 확인했습니다.

Original Abstract

Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information. We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency. We train policies using Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), and Direct Preference Optimization (DPO) in an offline RL setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data. On a corpus of 186 policy chunks spanning 10 CMS procedures, CQL achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval, while IQL matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies. Transition-level DPO matches CQL's 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a "selective-accurate" region on the Pareto frontier that dominates both CQL and BC. A behavioral cloning baseline matches CQL, confirming that advantage-weighted or preference-based policy extraction is needed to learn selective retrieval. Lambda ablation over step costs $λ\in \{0.05, 0.1, 0.2\}$ reveals a clear accuracy-efficiency inflection: only at $λ= 0.2$ does CQL transition from exhaustive to selective retrieval.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!