2604.22893v1 Apr 24, 2026 cs.LG

유틸리티 기반 데이터 가격 책정: LLM을 위한 토큰 수준의 품질 및 실증적 학습 효과

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu

Citations: 201

h-index: 7

Qimeng Luo

Citations: 1

h-index: 1

Kun Li

Citations: 40

h-index: 4

기존의 '행 수 × 품질 계수' 기반 데이터 평가 방법은 대규모 언어 모델(LLM)의 성능에 미치는 데이터의 미묘하고 비선형적인 기여도를 제대로 반영하지 못합니다. 본 논문에서는 정적인 평가 방식에서 벗어나 유틸리티 기반 가격 책정으로 전환하는 동적인 데이터 평가 프레임워크를 제시합니다. 저희의 접근 방식은 세 가지 계층으로 구성됩니다: (1) Shannon 엔트로피 및 데이터 품질 점수를 활용한 토큰 수준의 정보 밀도 측정; (2) 영향 함수, 프록시 모델 전략, 데이터 Shapley 값을 통한 실증적 학습 효과 측정; (3) 해시 기반 커밋, Merkle 트리, 위변조 방지 학습 장부를 통한 암호학적 검증 가능성 확보. 실제 세 가지 영역(명령어 추종, 수학적 추론, 코드 요약)에 대한 종합적인 실험 검증을 통해 프록시 기반 실증적 학습 효과가 실제 유틸리티와 거의 완벽하게 일치하며, 행 수 및 토큰 수 기반의 기존 방법보다 훨씬 우수한 성능을 보임을 입증했습니다. 본 프레임워크는 모델 지능에 대한 실제 기여도에 따라 고성능 데이터를 가격 책정하는 공정한 데이터 서비스 경제를 가능하게 하며, 신뢰할 수 있는 데이터 시장을 위한 투명성과 감사 기능을 제공합니다.

Original Abstract

Traditional data valuation methods based on ``row-count $\times$ quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!