2604.11003v1 Apr 13, 2026 cs.AI

에이전트 기반 데이터 과학을 위한 신뢰성 검증 방법

Sanity Checks for Agentic Data Science

Haoyang Huang

Citations: 21

h-index: 3

Zachary T. Rewolinski

Citations: 9

h-index: 2

Austin Zane

Citations: 15

h-index: 2

C. Singh

Citations: 400

h-index: 11

Chenglong Wang

Citations: 39

h-index: 3

Bin Yu

Citations: 108

h-index: 3

Jianfeng Gao

Citations: 150

h-index: 5

에이전트 기반 데이터 과학(ADS) 파이프라인은 기능과 활용도가 급격히 증가했으며, OpenAI Codex와 같은 시스템은 이제 데이터 세트를 직접 분석하고 통계적 질문에 대한 답변을 생성할 수 있습니다. 그러나 이러한 시스템은 사용자가 감지하기 어려운 잘못된 낙관적인 결론에 도달할 수 있습니다. 이를 해결하기 위해, 우리는 진실된 데이터 과학을 위한 예측 가능성-계산 가능성-안정성(PCS) 프레임워크에 기반한 경량 신뢰성 검증 방법을 제안합니다. 이러한 검증 방법은 합리적인 변형을 사용하여 에이전트가 신호를 노이즈로부터 안정적으로 구별할 수 있는지 확인하며, 긍정적인 결론이 뒷받침되지 않는다는 것을 드러낼 수 있는 반증 가능성 제약 조건으로 작용합니다. 두 가지 검증 방법을 함께 사용하면 ADS 결과의 신뢰성을 특징짓는 데 도움이 됩니다. 즉, 안정적인 신호를 찾았는지, 노이즈에 반응하고 있는지, 아니면 입력의 우연한 측면에 민감한지를 판단할 수 있습니다. 우리는 제안하는 방법을 제어된 신호 대 잡음 비율을 가진 합성 데이터에 대해 검증하여 신뢰성 검증 방법이 실제 신호 강도를 정확하게 추적한다는 것을 확인했습니다. 그런 다음, 우리는 OpenAI Codex를 사용하여 11개의 실제 데이터 세트에 대해 이러한 검증 방법을 적용하고 각 결론의 신뢰성을 평가했습니다. 그 결과, 11개의 데이터 세트 중 6개에서 긍정적인 결론이 충분히 뒷받침되지 않는다는 것을 발견했습니다. 또한, 우리는 ADS 시스템의 실패 원인을 분석하고 ADS가 자체적으로 보고하는 신뢰도가 실제 결론의 안정성과 잘 일치하지 않는다는 것을 확인했습니다.

Original Abstract

Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!