2602.11165v1 Jan 17, 2026 cs.CL

시간적으로 최신 정보에 대한 개방형 질문에서 LLM의 신뢰성 평가

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Aman Chadha

Citations: 1,600

h-index: 14

Vinija Jain

Citations: 1,913

h-index: 14

Amitava Das

Citations: 871

h-index: 9

Pushwitha Krishnappa

Citations: 2

h-index: 1

Tathagata Mukherjee

Citations: 43

h-index: 2

대규모 언어 모델(LLM)은 점점 더 다양한 분야의 질문 응답에 활용되고 있지만, LLM이 최신 정보에 대해 인간의 관점에 얼마나 부합하는지는 아직 충분히 연구되지 않았습니다. 본 연구에서는 RECOM(Reddit Evaluation for Correspondence of Models)이라는 벤치마크 데이터셋을 소개합니다. RECOM은 2025년 9월부터 수집된 15,000개의 최신 Reddit 질문과, 커뮤니티에서 생성한 참고 답변으로 구성되어 있습니다. 본 연구에서는 Llama3.1-8B, Mistral-7B, Gemma-2-9B, GPT-OSS-20B의 네 가지 오픈 소스 LLM이 이러한 질문에 어떻게 답변하는지 조사하고, 어휘 일치도(BLEU, ROUGE), 의미 유사성(BERTScore, MoverScore, 코사인 유사성), 그리고 논리적 추론(NLI)을 사용하여 LLM의 답변을 평가합니다. 주요 결과는 놀라운 의미-어휘적 역설을 보여줍니다. 모든 모델은 참고 답변과 99% 이상의 코사인 유사성을 보이지만, BLEU-1 점수는 8% 미만으로, 90% 이상의 격차를 보입니다. 이는 모델이 어휘를 그대로 사용하기보다는, 광범위한 재구성(paraphrasing)을 통해 의미를 보존한다는 것을 의미합니다. MoverScore (51-53%)는 이러한 경향을 뒷받침하며, 의미적 일관성을 측정하는 최적의 이동 비용을 나타내는 중간 위치를 차지합니다. 또한, 모델의 크기가 성능을 예측하지 못한다는 점도 확인되었습니다. Mistral-7B (70억 파라미터)는 GPT-OSS-20B (200억 파라미터)보다 모든 지표에서 더 나은 성능을 보였습니다. NLI 분석 결과, 모순되는 답변의 비율은 7% 미만으로, 모델이 인간의 합의와 직접적으로 충돌하는 내용을 생성하는 경우는 드물다는 것을 시사합니다. 이러한 결과는 추상적 생성 결과를 평가할 때 어휘 일치도 지표의 신뢰성에 의문을 제기하며, 표면적인 텍스트 일치 이상으로 의미적 충실도를 포착할 수 있는 다차원 평가 프레임워크의 필요성을 강조합니다. RECOM 데이터셋은 https://anonymous.4open.science/r/recom-D4B0 에서 공개적으로 이용할 수 있습니다.

Original Abstract

Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!