2605.30348v1 May 28, 2026 cs.CL

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Xiaohan Zhao
Xiaohan Zhao
Citations: 82
h-index: 6
Zhaoyi Li
Zhaoyi Li
Citations: 57
h-index: 3
Zhiqiang Shen
Zhiqiang Shen
Citations: 75
h-index: 5
Xinyi Shang
Xinyi Shang
Citations: 22
h-index: 2
Yaxin Luo
Yaxin Luo
Citations: 56
h-index: 5
Jiacheng Cui
Jiacheng Cui
Citations: 63
h-index: 4
Jiacheng Liu
Jiacheng Liu
Citations: 260
h-index: 6
Xinyue Bi
Xinyue Bi
Citations: 18
h-index: 2

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

0 Citations
0 Influential
3 Altmetric
15.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!