2602.11745v1 Feb 12, 2026 cs.AI

Text2GQL-Bench: 텍스트-그래프 쿼리 언어 변환 벤치마크 [실험, 분석 및 벤치마크]

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

Songlin Lyu

Citations: 4

h-index: 1

Lujie Ban

Citations: 4

h-index: 1

Yuyu Luo

Citations: 269

h-index: 9

Yongchao Liu

Citations: 519

h-index: 4

Tianqi Luo

Citations: 84

h-index: 5

Jirong Liu

Citations: 16

h-index: 1

Chenhao Ma

Citations: 65

h-index: 5

Nan Tang

Citations: 150

h-index: 3

Shipeng Qi

Citations: 8

h-index: 2

Heng Lin

Citations: 14

h-index: 3

Chuntao Hong

Citations: 533

h-index: 4

Zihan Wu

Citations: 5

h-index: 1

그래프 모델은 복잡한 관계가 풍부한 도메인에서의 데이터 분석에 필수적이다. Text-to-GQL(Text-to-Graph-Query-Language) 시스템은 자연어를 실행 가능한 그래프 쿼리로 변환하는 번역기 역할을 한다. 이러한 기능은 거대언어모델(LLM)이 그래프 데이터를 직접 분석하고 조작할 수 있게 하여, 그래프 데이터베이스 관리 시스템(GDBMS)을 위한 강력한 에이전트 인프라로 자리매김하게 한다. 최근의 발전에도 불구하고, 기존 데이터셋은 도메인 커버리지, 지원되는 그래프 쿼리 언어, 또는 평가 범위 면에서 제한적인 경우가 많다. Text-to-GQL 시스템의 발전은 다양한 그래프 쿼리 언어와 도메인에 걸쳐 모델의 역량을 체계적으로 비교할 수 있는 고품질 벤치마크 데이터셋과 평가 방법의 부재로 인해 저해되고 있다. 본 연구에서는 이러한 한계를 해결하기 위해 설계된 통합 Text-to-GQL 벤치마크인 Text2GQL-Bench를 제안한다. Text2GQL-Bench는 13개 도메인에 걸친 178,184개의 (질문, 쿼리) 쌍을 보유한 다중 GQL 데이터셋과, 다양한 도메인, 질문 추상화 수준, 그리고 이질적인 리소스를 가진 GQL에서 데이터셋을 생성하는 확장 가능한 구축 프레임워크를 결합한다. 포괄적인 평가를 지원하기 위해, 단일 엔드투엔드 지표를 넘어 문법적 유효성, 유사도, 의미적 정렬(alignment), 그리고 실행 정확도를 함께 보고하는 평가 방법을 도입한다. 우리의 평가는 ISO-GQL 생성에서 극명한 방언(dialect) 격차를 드러낸다. 강력한 LLM조차도 제로샷(zero-shot) 설정에서는 최대 4%의 실행 정확도(EX)만을 달성하며, 고정된 3-shot 프롬프트가 정확도를 약 50%까지 높여주지만 문법적 유효성은 여전히 70% 미만에 머문다. 또한, 파인튜닝된 8B 오픈 웨이트 모델은 45.1%의 실행 정확도와 90.8%의 문법적 유효성을 달성하여, 성능 향상의 대부분이 충분한 ISO-GQL 예제에 대한 노출을 통해 실현됨을 입증한다.

Original Abstract

Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.

2 Citations

0 Influential

4.5 Altmetric

24.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!