2603.10473v1 Mar 11, 2026 cs.CL

검색어 사용자의 선호도에 따른 대규모 언어 모델 정렬

Aligning Large Language Models with Searcher Preferences

Liyi Chen

Citations: 300

h-index: 7

Qimeng Wang

Citations: 93

h-index: 4

Chengqiang Lu

Citations: 96

h-index: 3

Yi Wu

Citations: 31

h-index: 4

Yao Hu

Citations: 22

h-index: 3

Hui Xiong

Citations: 30

h-index: 1

Yan Gao

Citations: 11

h-index: 2

Peilun Zhou

Citations: 65

h-index: 4

Wei Wu

Citations: 89

h-index: 5

아이템 중심의 순위 결정에서 답변 중심의 통합으로의 패러다임 전환은 검색 엔진의 역할을 재정의하고 있습니다. 최근 산업계에서는 생성 기술을 사용하여 전자상거래 분야의 제한된 아이템 순위를 개선했지만, 대규모 콘텐츠 플랫폼에서 자유로운 생성 검색을 연구하고 구현하는 것은 아직 제한적입니다. 이러한 환경은 노이즈가 많은 검색 결과에 대한 강건성, 안전에 대한 절대적인 보장, 그리고 다양한 사용자 요구 사항과의 일치성 등 여러 가지 과제를 제시합니다. 본 연구에서는 자유로운 생성 검색을 위한 최초의 대규모 언어 모델(LLM)인 SearchLLM을 소개합니다. 우리는 사실 기반의 정확성, 기본적인 답변 품질, 형식 준수와 같은 필수적인 제약 조건을 분리하고, 노이즈가 많은 검색 결과에 대한 강건성과 사용자 요구 사항과의 일치성을 촉진하는 행동 최적화 목표를 포함하는 계층적이고 다차원적인 보상 시스템을 설계했습니다. 구체적으로, 우리의 보상 모델은 사용자 쿼리, 세션 기록 및 검색된 증거 세트를 기반으로 응답을 평가하며, 규칙 기반 검사와 인간이 조정한 LLM 평가 모델을 결합하여 해석 가능한 점수 벡터를 생성합니다. 우리는 그룹 상대 정책 최적화(GRPO)를 사용하여 SearchLLM을 최적화하기 위한 Gated Aggregation 전략을 도입했습니다. SearchLLM을 RedNote의 AI 검색 기능에 배포했으며, 오프라인 평가 및 온라인 A/B 테스트 결과, 생성 품질과 사용자 참여도가 향상되었으며, 유효 소비율은 1.03% 증가하고, 재검색 비율은 2.81% 감소했습니다. 동시에 엄격한 안전 및 신뢰성 기준을 유지했습니다.

Original Abstract

The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!