2602.03358v1 Feb 03, 2026 cs.AI

GFlowPO: 언어 모델 프롬프트 최적화기로서의 생성 흐름 네트워크

GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

Sung Ju Hwang

Citations: 157

h-index: 6

Haebeom Lee

Citations: 643

h-index: 12

Junmo Cho

Citations: 40

h-index: 2

Suhan Kim

Citations: 19

h-index: 2

Sangjune An

Citations: 17

h-index: 1

Minsu Kim

Citations: 297

h-index: 7

Dong Bok Lee

Citations: 706

h-index: 9

Heejun Lee

Citations: 11

h-index: 1

언어 모델(LM)을 위한 효과적인 프롬프트를 찾는 것은 중요하지만 매우 어렵기로 악명 높습니다. 프롬프트 공간은 조합적으로 방대하며, 비용이 많이 드는 타겟 LM 평가로 인해 보상이 희소하기 때문입니다. 그러나 기존의 강화 학습(RL) 기반 프롬프트 최적화기들은 종종 온폴리시(on-policy) 업데이트와 고정된 분포에서 샘플링된 메타 프롬프트에 의존하여, 표본 효율성이 떨어지는 결과를 초래합니다. 우리는 프롬프트 탐색을 메타 프롬프트된 참조 LM 사전 확률(prior)로 정규화된 잠재 프롬프트에 대한 사후 추론 문제로 간주하는 확률적 프롬프트 최적화 프레임워크인 GFlowPO를 제안합니다. 첫 번째 단계에서는 과거 프롬프트 평가를 재사용하는 리플레이 기반 학습 정책을 사용하여 표본 효율적인 탐색을 가능하게 함으로써, 경량 프롬프트 LM을 오프폴리시(off-policy) 생성 흐름 네트워크(GFlowNet) 목적함수로 미세 조정합니다. 두 번째 단계에서는 학습이 필요 없는 메커니즘인 동적 메모리 업데이트(DMU)를 도입합니다. 이는 (i) 리플레이 버퍼의 다양한 프롬프트와 (ii) 작은 우선순위 큐의 상위 성능 프롬프트를 모두 주입하여 메타 프롬프트를 업데이트함으로써, 탐색 과정을 고보상 영역으로 점진적으로 집중시킵니다. 퓨샷 텍스트 분류, 지시 유도 벤치마크 및 질의응답 작업 전반에 걸쳐 GFlowPO는 최근의 이산형 프롬프트 최적화 베이스라인들을 일관되게 능가하는 성능을 보입니다.

Original Abstract

Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.

0 Citations

0 Influential

6 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!