2604.27747v1 Apr 30, 2026 cs.IR

LLM 기반 생성형 목록 추천 시스템에서 추론 속도 향상을 위한 위치 인지 드래프트 기법

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Qingpeng Cai

Citations: 68

h-index: 4

Peng Jiang

Citations: 40

h-index: 4

Jiaju Chen

Citations: 27

h-index: 3

Chongming Gao

Citations: 37

h-index: 3

Chenxiao Fan

Citations: 45

h-index: 3

Haoyan Liu

Citations: 16

h-index: 2

Xiangnan He

Citations: 941

h-index: 18

대규모 언어 모델(LLM) 기반의 생성형 목록 추천 시스템은 빠르게 발전해 왔지만, 디코딩 과정은 여전히 순차적이며 지연 시간이 발생하기 쉽습니다. 추론 속도를 향상시키기 위해, 본 논문에서는 목표 분포를 변경하지 않고도 추론 속도를 높이는 방법인 '스펙티브 디코딩(SD)'을 제안합니다. SD는 작은 드래프트 모델을 사용하여 여러 개의 다음 토큰을 동시에 제안하고, 목표 LLM이 이를 검증하여 가장 긴 접두사를 선택함으로써 한 번에 여러 단계를 건너뛸 수 있습니다. 그러나 생성형 추천 시스템에서 각 항목은 여러 개의 의미-ID 토큰으로 표현되며, 이 토큰들은 종종 구분 기호를 포함합니다. 현재의 드래프트 방식은 이러한 토큰들을 일반적으로 동일하게 취급하는데, 이는 다음과 같은 두 가지 중요한 사실을 간과합니다. (i) 토큰의 의미는 항목 내의 특정 위치에 따라 달라지고, (ii) 추론 깊이가 깊어질수록 불확실성이 증가합니다. 이러한 요소를 고려하지 않으면, SD의 속도 향상 효과가 제한될 수 있습니다. 본 논문에서는 'PAD-Rec', 즉 위치 인지 드래프트 기법을 제안합니다. PAD-Rec은 드래프트 모델에 두 가지 상호 보완적인 신호를 추가하는 가벼운 모듈입니다. 항목 위치 임베딩은 각 토큰의 항목 내 위치를 명시적으로 인코딩하여 구조적 인식을 강화합니다. 단계 위치 임베딩은 드래프트 단계를 인코딩하여 모델이 깊이에 따른 불확실성에 적응하고 제안 품질을 향상시킬 수 있도록 합니다. 이러한 신호들을 기본 특징들과 통합하기 위해, 학습 가능한 계수와 컨텍스트 기반 게이트를 사용하여 항목 슬롯과 드래프트 단계를 제어합니다. 이 모듈은 학습이 가능하며, 표준 드래프트 모델과 쉽게 통합될 수 있으며, 추론 오버헤드가 미미합니다. 실제 데이터 세트 4개에 대한 광범위한 실험 결과, 최대 3.1배의 실시간 속도 향상과 강력한 SD 기반 모델 대비 평균 5%의 실시간 속도 향상 효과를 확인했으며, 추천 품질은 대체로 유지되었습니다.

Original Abstract

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!