2603.29112v1 Mar 31, 2026 cs.AI

GISTBench: 증거 기반 관심사 검증을 통한 LLM의 사용자 이해 능력 평가

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Xiangjun Fan

Citations: 98

h-index: 5

Iordanis Fostiropoulos

Citations: 142

h-index: 6

Muhammad Azhar

Citations: 9

h-index: 2

Abdalaziz Sawwan

Citations: 31

h-index: 2

Bo-Xuan Fang

Citations: 9

h-index: 2

Jiayi Liu

Citations: 27

h-index: 2

Hanchao Yu

Citations: 34

h-index: 2

Qingxing Guo

Citations: 0

h-index: 0

Fei Liu

Citations: 18

h-index: 2

Yuchen Liu

Citations: 75

h-index: 4

Jianyu Wang

Citations: 40

h-index: 3

본 논문에서는 대규모 언어 모델(LLM)이 추천 시스템에서 사용자의 상호 작용 기록을 통해 사용자를 얼마나 잘 이해하는지 평가하는 벤치마크인 GISTBench를 소개합니다. 기존의 추천 시스템 벤치마크가 항목 예측 정확도에 초점을 맞추는 것과 달리, GISTBench는 LLM이 참여 데이터를 통해 사용자의 관심사를 얼마나 잘 추출하고 검증하는지 평가합니다. 우리는 두 가지 새로운 평가 지표를 제안합니다. 첫째, Interest Groundedness (IG)는 정밀도와 재현율 구성 요소로 분해되어, 환각된 관심사 범주에 대한 벌점을 부여하고, 보편성을 높이는 데 대한 보상을 제공합니다. 둘째, Interest Specificity (IS)는 LLM이 예측한 사용자 프로필의 구체성을 평가합니다. 우리는 글로벌 숏폼 비디오 플랫폼의 실제 사용자 상호 작용을 기반으로 구축된 합성 데이터 세트를 공개합니다. 우리의 데이터 세트는 암묵적 및 명시적 참여 신호와 풍부한 텍스트 설명을 포함합니다. 우리는 사용자 설문 조사를 통해 데이터 세트의 신뢰성을 검증하고, 70억에서 120억 파라미터에 이르는 8개의 오픈 가중 LLM을 평가했습니다. 우리의 연구 결과는 현재 LLM의 성능 병목 현상을 드러내며, 특히 다양한 상호 작용 유형에서 참여 신호를 정확하게 계산하고 할당하는 능력의 한계를 보여줍니다.

Original Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!