2604.05114v1 Apr 06, 2026 cs.CL

π²: 구조 기반 추론 데이터가 대규모 언어 모델의 장문 맥락 추론 능력을 향상시키는 방법

$π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models

Pratibha Zunjare

Citations: 43

h-index: 2

Sha Li

Citations: 5

h-index: 1

Quyet V. Do

Hong Kong University of Science and Technology

Citations: 1,823

h-index: 6

Thinh Pham

Virginia Tech

Citations: 119

h-index: 5

N. Nguyên

Citations: 1

h-index: 1

Tu Vu

Citations: 12

h-index: 2

본 연구에서는 대규모 언어 모델(LLM)의 장문 맥락 추론 능력을 향상시키기 위해 초기 구조화된 데이터를 활용하여 추론 데이터를 구축하는 파이프라인을 연구합니다. 저희의 접근 방식인 π²는 엄격한 질의응답(QA) 데이터 관리를 통해 고품질 추론 데이터를 생성합니다. 구체적으로 다음과 같은 단계를 거칩니다. 1) 위키피디아에서 표를 추출하고 확장합니다. 2) 수집된 표와 관련 맥락을 바탕으로, 자동으로 답이 결정되고 이중 경로 코드 실행을 통해 검증되는 현실적이고 다단계 분석 추론 질문을 생성합니다. 3) 현실적인 웹 검색 맥락을 제공하는 QA 쌍의 해답으로, 단계별 구조화된 추론 과정을 역번역합니다. extsc{ iny{gpt-oss-20b}} 및 extsc{ iny{Qwen3-4B-Instruct-2507}} 모델을 π² 데이터셋으로 지도 학습한 결과, 네 가지 장문 맥락 추론 벤치마크 및 저희가 자체적으로 개발한 π²-Bench에서 각각 평균 절대 정확도가 +4.3% 및 +2.7% 향상되는 일관된 성능 향상을 보였습니다. 주목할 만한 점은, 저희의 데이터셋이 자기 증류(self-distillation)를 가능하게 한다는 것입니다. extsc{ iny{gpt-oss-20b}} 모델은 자체 추론 과정을 활용하여 평균 성능을 +4.4% 향상시켰으며, 이는 π²의 유용성을 입증합니다. 저희의 코드, 데이터, 모델은 https://github.com/vt-pi-squared/pi-squared 에서 오픈 소스로 제공됩니다.

Original Abstract

We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $π^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $π^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $π^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $π^2$'s usefulness. Our code, data, and models are open-source at https://github.com/vt-pi-squared/pi-squared.

0 Citations

0 Influential

26.4657359028 Altmetric

132.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!