2603.03790v1 Mar 04, 2026 cs.CL

T2S-Bench 및 구조적 사고(Structure-of-Thought): 포괄적인 텍스트-구조 추론의 벤치마킹 및 프롬프트 활용

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Wei Wen

Citations: 54

h-index: 4

Qinsi Wang

Citations: 228

h-index: 8

Jinghan Ke

Citations: 64

h-index: 4

Yifei Wang

Citations: 34

h-index: 2

Martin Kuo

Citations: 484

h-index: 7

Zishan Shao

Citations: 31

h-index: 3

Yueqian Lin

Citations: 386

h-index: 9

Qiyang Qian

Citations: 13

h-index: 1

Hancheng Ye

Citations: 78

h-index: 6

Jinhee Kim

Citations: 28

h-index: 1

Dongting Li

Citations: 18

h-index: 3

Ting Jiang

Citations: 21

h-index: 3

Chiyue Wei

Citations: 97

h-index: 6

Helen Li

Citations: 28

h-index: 1

Yiran Chen

Citations: 104

h-index: 6

사람은 복잡한 독해 과제를 수행할 때 핵심 내용을 파악하고, 그 관계를 추론하며, 정보를 구조화하여 이해를 돕고 응답을 생성합니다. 마찬가지로, 대규모 언어 모델이 텍스트 구조를 활용하여 텍스트 처리 성능을 향상시킬 수 있을까요? 이 연구에서는 이러한 가능성을 탐구하기 위해, 모델이 중간 텍스트 구조를 명시적으로 생성하도록 유도하는 프롬프트 기법인 '구조적 사고(SoT)'를 소개합니다. SoT는 8가지 작업과 3가지 모델 패밀리에 걸쳐 성능을 지속적으로 향상시켰습니다. 이러한 통찰력을 바탕으로, 모델의 텍스트-구조 변환 능력을 평가하고 개선하기 위해 설계된 최초의 벤치마크인 'T2S-Bench'를 제시합니다. T2S-Bench는 6개의 과학 분야와 32가지 구조 유형에 걸쳐 1,800개의 샘플로 구성되어 있으며, 정확성, 공정성 및 품질을 보장하기 위해 엄격하게 구축되었습니다. 45개의 주류 모델에 대한 평가 결과, 상당한 성능 향상 가능성을 보여줍니다. 다중 단계 추론 작업의 평균 정확도는 52.1%에 불과하며, 가장 발전된 모델조차도 엔드-투-엔드 추출에서 58.1%의 노드 정확도를 달성하는 데 그칩니다. 또한, Qwen2.5-7B-Instruct 모델에서 SoT만 사용했을 때에도 8가지 다양한 텍스트 처리 작업에서 평균 +5.7%의 성능 향상을 보였으며, T2S-Bench를 사용한 추가적인 튜닝을 통해 이 향상폭은 +8.6%로 증가했습니다. 이러한 결과는 명시적인 텍스트 구조화의 가치와 SoT 및 T2S-Bench의 상호 보완적인 기여를 강조합니다. 데이터셋 및 평가 코드는 https://t2s-bench.github.io/T2S-Bench-Page/ 에서 확인할 수 있습니다.

Original Abstract

Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at https://t2s-bench.github.io/T2S-Bench-Page/.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!