2604.12967v1 Apr 14, 2026 cs.AI

순환 일관성 기반 탐색: 질문 재구성 가능성을 탐색 에이전트 훈련을 위한 보상 지표로 활용

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Hayeon Lee

Citations: 330

h-index: 9

Cho-Jui Hsieh

Citations: 0

h-index: 0

Alexander Min Meta Superintelligence Labs

Citations: 0

h-index: 0

Ucla

Citations: 153

h-index: 1

Sohyun An

Citations: 60

h-index: 3

Shuibenyang Yuan

Citations: 0

h-index: 0

강화 학습(RL)은 복잡한 정보 검색 작업에서 탐색 에이전트를 최적화하는 데 강력한 잠재력을 보여주었습니다. 그러나 기존 접근 방식은 주로 정답과 같은 정형화된 감독 데이터를 사용하는데, 이는 확장하기 어렵습니다. 이러한 제한 사항을 해결하기 위해, 우리는 순환 일관성 기법에서 영감을 받은, 정형화된 감독 데이터 없이 탐색 에이전트를 훈련하는 프레임워크인 순환 일관성 기반 탐색(CCS)을 제안합니다. 우리의 핵심 가설은 최적의 탐색 경로는, 불충분하거나 관련 없는 경로와 달리, 질문의 의도를 손실 없이 표현하는 것이라는 것입니다. 따라서 고품질의 경로는 원래 질문을 정확하게 재구성하는 데 필요한 정보를 보존해야 하며, 이는 정책 최적화를 위한 보상 신호를 유도합니다. 그러나 단순한 순환 일관성 목표는 정보 누출에 취약합니다. 왜냐하면 재구성이 근본적인 검색 과정이 아닌 표면적인 어휘적 단서에 의존할 수 있기 때문입니다. 이러한 효과를 줄이기 위해, 최종 응답을 제외하고, 검색 쿼리에 대한 개체명 인식(NER) 마스크를 적용하는 정보 병목 현상을 적용했습니다. 이러한 제약 조건은 재구성이 검색된 관찰과 구조적 틀을 모두 활용하도록 강제하여, 결과적인 보상 신호가 언어적 중복이 아닌 정보적 충분성을 반영하도록 합니다. 질문-응답 벤치마크에 대한 실험 결과, CCS는 정형화된 감독 데이터를 사용하는 기준 모델과 유사한 성능을 달성했으며, 정형화된 감독 데이터를 사용하지 않는 기존 방법보다 우수한 성능을 보였습니다. 이러한 결과는 CCS가 정형화된 감독 데이터가 없는 환경에서 탐색 에이전트를 훈련하는 데 확장 가능한 훈련 패러다임을 제공한다는 것을 시사합니다.

Original Abstract

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!