2605.05702v1 May 07, 2026 cs.AI

지식 그래프 경로를 활용한 자기 진화 검색 에이전트의 중간 수준 지도 학습

Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

Yan Gao

Citations: 44

h-index: 4

Yao Hu

Citations: 24

h-index: 3

Xiaochi Wei

Citations: 11

h-index: 2

Jun Liu

Citations: 154

h-index: 3

Yi Wu

Citations: 28

h-index: 4

Huyu Wu

Citations: 16

h-index: 2

자기 진화 검색 에이전트는 인간이 작성한 학습 질문에 대한 의존성을 줄이기 위해 자체적으로 검색 작업을 생성하고 해결합니다. 본 연구는 다단계 검색 및 추론을 통해 질문을 생성하고 답변하는 대표적인 제안자-해결자 프레임워크인 Search Self-Play (SSP)를 기반으로 합니다. 그러나 실제 SSP는 다음과 같은 두 가지 문제점에 직면합니다. 첫째, 제안자가 질문을 구성할 때 관련된 맥락 없이 고립된 답변 엔티티만을 사용하여 질문을 생성하기 때문에, 초기 자기 학습 과정에서 많은 유효하지 않거나 검증할 수 없는 질문이 생성됩니다. 둘째, 해결자는 이진 형태의 보상만을 받기 때문에, 부분적으로 올바른 검색 경로에서 얻을 수 있는 유용한 정보를 활용하지 못합니다. 본 연구는 이러한 문제점들을 해결하기 위해 지식 그래프 경로를 재사용하여 질문 생성 및 보상 설계에 대한 중간 수준의 지침을 제공합니다. 첫째, LLM(Large Language Model) 기반의 지식 그래프 부분 그래프를 활용하여 질문 생성을 돕고, 제안자에게 관련된 맥락을 제공합니다. 둘째, 다중 단계 질문을 구성하고 해결하는 과정에서 중복되는 중간 엔티티가 존재한다는 것을 확인했습니다. 질문을 구성하는 데 사용된 사실 기반 연결 고리(factual bridges)는 질문에 대한 답변을 위한 대략적인 경로 지점(waypoints)을 제공할 수 있습니다. 이러한 중복성을 활용하여, 본 연구는 Waypoint Coverage Reward (WCR)를 도입합니다. WCR은 해결자가 올바르지 않은 경로를 탐색할 때, 해당 경로가 질문 구성 경로 상의 엔티티를 얼마나 포함하는지에 따라 부분적인 보상을 제공하며, 올바른 답변에는 전체 보상을 제공합니다. 7개의 질문 답변 벤치마크 및 9가지 모델 구성에 대한 실험 결과, 제안된 방법은 모든 구성에서 표준 SSP보다 평균 점수가 향상되었으며, 특히 다중 단계 질문 답변 작업에서 상당한 개선 효과를 보였습니다. 이러한 결과는 지식 그래프 경로가 추가적인 작업별 인간 주석이나 수동으로 라벨링된 프로세스 단계를 사용하지 않고도, 관계 기반 지침과 프로세스 피드백을 제공하는 경량의 중간 수준 지도 학습 방법으로 재사용될 수 있음을 시사합니다.

Original Abstract

Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!