Y

Yimin Deng

Total Citations
16
h-index
3
Papers
3

Publications

#1 2605.05826v1 May 07, 2026

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

Zhengru Fang Kai Ming Ting Yimin Deng Yang Xu Kun Yao +1
0 Citations
#2 2604.24186v1 Apr 27, 2026

MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning

Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.

Yujia Liu Zichuan Fu Yejing Wang Xian Wu Yefeng Zheng +7
0 Citations
#3 2604.24175v1 Apr 27, 2026

AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models

Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.

Yujia Liu Zichuan Fu Yejing Wang Xian Wu Yefeng Zheng +6
0 Citations