2605.06343v1 May 07, 2026 cs.AI

격차를 주목해야 할까? 표 형태 기초 모델을 위한 실제 데이터와 합성 데이터 사전 지식의 분포 비교

Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

T. M. S. Filho

Citations: 16

h-index: 3

Alex O. Davies

Citations: 16

h-index: 3

Nirav Ajmeri

Citations: 912

h-index: 16

표 형태 기초 모델은 세 가지 유형의 데이터셋 중 하나를 사용하여 사전 학습됩니다. 여기에는 벤치마크 저장소에서 추출한 큐레이션된 데이터셋, 웹에서 대규모로 수집된 테이블, 그리고 파라미터화된 생성 사전 분포에서 샘플링된 합성 테이블이 포함됩니다. 사전 학습 데이터가 모델 성능에 매우 중요한 역할을 하지만, 이러한 데이터셋들이 분포 측면에서 어떻게 관련되어 있는지, 그리고 이것이 하위 작업 성능에 어떤 영향을 미치는지에 대한 연구는 거의 이루어지지 않았습니다. 본 연구에서는 표 형태 기초 모델 훈련에 사용되는 세 가지 대표적인 데이터셋을 분석합니다. T4 데이터셋은 웹에서 수집된 데이터, TabFM 데이터셋은 Kaggle에서 큐레이션된 테이블, 그리고 TabICL 데이터셋은 공개적으로 사용 가능한 파라미터를 가진 유일한 합성 사전 지식 데이터셋입니다. 우리는 각 데이터셋을 전체 테이블, 열, 상관 관계 등에 대한 집계 특징을 사용하여 특성화하고, 판별 AUC 및 k-NN 커버리지 지표를 사용하여 비교합니다. 연구 결과, TabICL 합성 사전 지식은 실제 테이블의 분포 영역 내에서 좁은 영역을 차지하며, 86,000개 이상의 하이퍼파라미터 조합을 최적화하더라도 이러한 차이를 좁힐 수 없다는 것을 확인했습니다. 또한, 큐레이션된 데이터셋과 웹에서 수집된 데이터셋은 특징 공간에서 분포 수준에서 대체 가능성이 높다는 것을 발견했습니다. 놀랍게도, 합성 사전 학습 데이터와 실제 테이블 간의 분포 차이는 특징 기반 근접성 측정 또는 TabICL 자체의 내부 표현 방식 모두에서 성능에 명확하게 감지될 수 있는 영향을 미치지 않습니다. 이는 실제 데이터 분포의 커버리지가 TabICL의 일반화 성능의 주요 요인이 아님을 시사합니다.

Original Abstract

Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!