W

Wenbo Li

Total Citations
42
h-index
4
Papers
2

Publications

#1 2602.00564v1 Jan 31, 2026

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.

Kevin I-Kai Wang Xiaoxiao Xu Weiqi Zhai Ya-Qi Mo Ze Xu +12
0 Citations
#2 2602.00017v1 Jan 14, 2026

SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.

Tahreem Yasir Wenbo Li Benyamin T. Tabarsi Tiffany Barnes Dongkuan Xu +2
0 Citations