2605.30348v1 May 28, 2026 cs.CL

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Xiaohan Zhao

Citations: 82

h-index: 6

Zhaoyi Li

Citations: 57

h-index: 3

Zhiqiang Shen

Citations: 75

h-index: 5

Xinyi Shang

Citations: 22

h-index: 2

Yaxin Luo

Citations: 56

h-index: 5

Jiacheng Cui

Citations: 63

h-index: 4

Jiacheng Liu

Citations: 260

h-index: 6

Xinyue Bi

Citations: 18

h-index: 2

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!