2603.18447v1 Mar 19, 2026 cs.DB

SODIUM: 개방형 웹 데이터를 활용한 질의 가능한 데이터베이스 구축

SODIUM: From Open Web Data to Queryable Databases

Daniel Kang

Citations: 251

h-index: 7

Chuxuan Hu

Citations: 30

h-index: 3

Philip Li

Citations: 0

h-index: 0

Maxwell Yang

Citations: 0

h-index: 0

연구 과정에서, 전문가들은 다양한 웹 소스에서 데이터를 통합해야 하는 분석적 질문을 자주 던집니다. 따라서 분석을 시작하기 전에 상당한 노력을 들여 원시 데이터를 검색, 추출 및 구성해야 합니다. 우리는 이 과정을 SODIUM이라는 과제로 정의하며, 웹과 같은 개방형 영역을 체계적으로 인스턴스화하여 후속 질의를 지원해야 하는 잠재적 데이터베이스로 개념화합니다. SODIUM 문제를 해결하려면 (1) 개방형 웹에 대한 심층적이고 전문적인 탐색을 수행하고, (2) 체계적인 정보 추출을 위해 구조적 상관관계를 활용하며, (3) 수집된 정보를 일관성 있는 질의 가능한 데이터베이스 인스턴스로 통합해야 합니다. SODIUM 자동화의 어려움을 정량화하기 위해, 우리는 6개 분야의 출판된 학술 논문에서 파생된 105개의 작업으로 구성된 벤치마크인 SODIUM-Bench를 구축했습니다. 여기서 시스템은 개방형 웹을 탐색하여 다양한 소스에서 데이터를 수집하고 구조화된 테이블로 통합하는 작업을 수행합니다. 기존 시스템은 SODIUM 작업에서 어려움을 겪습니다. 우리는 SODIUM-Bench에서 6개의 최첨단 AI 에이전트를 평가했으며, 가장 강력한 기준 모델은 46.5%의 정확도를 달성하는 데 그쳤습니다. 이러한 격차를 해소하기 위해, 우리는 웹 탐색기와 캐시 관리자로 구성된 다중 에이전트 시스템인 SODIUM-Agent를 개발했습니다. 제안된 ATP-BFS 알고리즘으로 구동되고, 캐시된 소스와 탐색 경로를 체계적으로 관리하여 최적화된 SODIUM-Agent는 심층적이고 포괄적인 웹 탐색을 수행하고 구조적으로 일관성 있는 정보 추출을 수행합니다. SODIUM-Agent는 SODIUM-Bench에서 91.1%의 정확도를 달성하여 가장 강력한 기준 모델보다 약 2배, 가장 낮은 성능의 모델보다 최대 73배 더 뛰어난 성능을 보였습니다.

Original Abstract

During research, domain experts often ask analytical questions whose answers require integrating data from a wide range of web sources. Thus, they must spend substantial effort searching, extracting, and organizing raw data before analysis can begin. We formalize this process as the SODIUM task, where we conceptualize open domains such as the web as latent databases that must be systematically instantiated to support downstream querying. Solving SODIUM requires (1) conducting in-depth and specialized exploration of the open web, which is further strengthened by (2) exploiting structural correlations for systematic information extraction and (3) integrating collected information into coherent, queryable database instances. To quantify the challenges in automating SODIUM, we construct SODIUM-Bench, a benchmark of 105 tasks derived from published academic papers across 6 domains, where systems are tasked with exploring the open web to collect and aggregate data from diverse sources into structured tables. Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy. To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager. Powered by our proposed ATP-BFS algorithm and optimized through principled management of cached sources and navigation paths, SODIUM-Agent conducts deep and comprehensive web exploration and performs structurally coherent information extraction. SODIUM-Agent achieves 91.1% accuracy on SODIUM-Bench, outperforming the strongest baseline by approximately 2 times and the weakest by up to 73 times.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!