ArtI-Insight

#1 2603.27076v1 Mar 28, 2026

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is error-prone (<70% accuracy), but degrades performance by 4-6 percentage points through over-specification when feedback is already reliable (>85%). Critically, we identify a shared complexity ceiling; no model or pipeline reliably succeeds on proof states exceeding complexity 4-5. These findings challenge the assumption that adding verifiers or richer context universally improves tutoring, motivating adaptive, difficulty-aware architectures that route problems by estimated complexity and upstream reliability.

Sutapa Dey Tithi Tahreem Yasir Xiaoyi Tian Benyamin T. Tabarsi Tiffany Barnes +5

1 Citations

#2 2602.03950v2 Feb 03, 2026

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.

Aditya Basarkar Benyamin T. Tabarsi Tiffany Barnes Dongkuan Xu

0 Citations

#3 2602.00017v1 Jan 14, 2026

SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.

Tahreem Yasir Wenbo Li Benyamin T. Tabarsi Tiffany Barnes Dongkuan Xu +2

0 Citations

Dongkuan Xu

Publications

When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring

Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation

SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations