2605.28301v1 May 27, 2026 cs.AI

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Yujia Liu
Yujia Liu
Citations: 0
h-index: 0
Zhizhong Fu
Zhizhong Fu
Citations: 13
h-index: 2
Honghan Wu
Honghan Wu
Citations: 4
h-index: 2
Zhaoyang Jiang
Zhaoyang Jiang
Citations: 8
h-index: 2
Jiacong Mi
Jiacong Mi
Citations: 3
h-index: 1
Zicheng Li
Zicheng Li
Citations: 3
h-index: 1
Xuanqi Peng
Xuanqi Peng
Citations: 86
h-index: 4
Yunsoo Kim
Yunsoo Kim
University College London
Citations: 214
h-index: 8

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!