2605.26554v1 May 26, 2026 cs.LG

Linear and Neural Dueling Bandits with Delayed Feedback

Mingze Kong
Mingze Kong
Citations: 17
h-index: 3
Zhi Hong
Zhi Hong
Citations: 3,281
h-index: 3
Zhiyong Wang
Zhiyong Wang
Citations: 28
h-index: 3
Zhongxiang Dai
Zhongxiang Dai
Citations: 65
h-index: 5
Jieming Mao
Jieming Mao
Citations: 32
h-index: 4
Xiangyi Wang
Xiangyi Wang
Citations: 6
h-index: 1
Pingchen Lu
Pingchen Lu
Citations: 0
h-index: 0

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

0 Citations
0 Influential
2.5 Altmetric
12.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!