2604.04815v1 Apr 06, 2026 cs.CL

LiveFact: LLM 기반 허위 정보 탐지 시스템을 위한 동적, 시간 인지 벤치마크

LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

Changhong Jin

Citations: 27

h-index: 3

Shuhao Guan

Citations: 153

h-index: 5

Nan Yan

Citations: 27

h-index: 4

Yuke Mei

Citations: 14

h-index: 2

Mohand-Tahar Kechadi

Citations: 154

h-index: 4

Chen Xu

Citations: 5

h-index: 1

Yingjie Niu

University College Dublin

Citations: 59

h-index: 4

Liming Chen

Citations: 1

h-index: 1

대규모 언어 모델(LLM)의 급속한 발전은 허위 정보 탐지 및 사실 확인 작업을 단순한 분류에서 복잡한 추론으로 변화시켰습니다. 그러나 평가 프레임워크는 이러한 변화에 발맞추지 못하고 있습니다. 현재 벤치마크는 정적이기 때문에 벤치마크 데이터 오염(BDC)에 취약하며, 시간적 불확실성 하에서의 추론 능력을 평가하는 데 효과적이지 않습니다. 이러한 문제를 해결하기 위해, 우리는 실제 허위 정보 탐지 환경의 "혼란스러운 상황"을 시뮬레이션하는 지속적으로 업데이트되는 벤치마크인 LiveFact를 소개합니다. LiveFact는 동적이고 시간적 특성을 갖는 증거 집합을 사용하여 모델이 암기된 지식이 아닌, 진화하고 불완전한 정보에 대한 추론 능력을 평가합니다. 우리는 최종 검증을 위한 분류 모드와 증거 기반 추론을 위한 추론 모드를 제안하며, BDC를 명시적으로 모니터링하는 구성 요소를 포함합니다. 22개의 LLM으로 수행한 테스트 결과, Qwen3-235B-A22B와 같은 오픈 소스 Mixture-of-Experts 모델이 독점적인 최첨단 시스템과 동등하거나 뛰어넘는 성능을 보이는 것으로 나타났습니다. 더욱 중요한 것은, 우리의 분석 결과 상당한 "추론 격차"가 존재한다는 것을 발견했습니다. 우수한 성능을 보이는 모델은 초기 데이터에서 검증할 수 없는 주장을 인식함으로써, 기존의 정적 벤치마크에서 간과되는 "인지적 겸손"을 보여줍니다. LiveFact는 견고하고 시간 인지적인 AI 검증을 평가하는 지속 가능한 기준을 제시합니다.

Original Abstract

The rapid development of Large Language Models (LLMs) has transformed fake news detection and fact-checking tasks from simple classification to complex reasoning. However, evaluation frameworks have not kept pace. Current benchmarks are static, making them vulnerable to benchmark data contamination (BDC) and ineffective at assessing reasoning under temporal uncertainty. To address this, we introduce LiveFact a continuously updated benchmark that simulates the real-world "fog of war" in misinformation detection. LiveFact uses dynamic, temporal evidence sets to evaluate models on their ability to reason with evolving, incomplete information rather than on memorized knowledge. We propose a dual-mode evaluation: Classification Mode for final verification and Inference Mode for evidence-based reasoning, along with a component to monitor BDC explicitly. Tests with 22 LLMs show that open-source Mixture-of-Experts models, such as Qwen3-235B-A22B, now match or outperform proprietary state-of-the-art systems. More importantly, our analysis finds a significant "reasoning gap." Capable models exhibit epistemic humility by recognizing unverifiable claims in early data slices-an aspect traditional static benchmarks overlook. LiveFact sets a sustainable standard for evaluating robust, temporally aware AI verification.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!