2601.04711v1 Jan 08, 2026 cs.CL

DSC2025 -- ViHallu 챌린지: 베트남어 LLM에서 발생하는 환각 현상 감지

DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

A. Nguyen

Citations: 7

h-index: 2

Khanh Quoc Tran

Citations: 16

h-index: 2

Tin Van Huynh

University of Information technology

Citations: 232

h-index: 8

Phuoc Nguyen

Citations: 15

h-index: 1

K. Nguyen

Citations: 62

h-index: 4

Cam-Tu Nguyen

Citations: 128

h-index: 5

대규모 언어 모델(LLM)이 실제 환경에서 사용될 때, 유창하고 그럴듯하게 들리지만 사실과 모순되거나 허구적인 정보를 생성하는 '환각' 현상을 보이는 경우가 많아, LLM의 신뢰성은 크게 제한됩니다. 환각 감지는 최근 영어 중심의 벤치마크에서 중요한 과제로 부상했지만, 베트남어와 같이 자원이 부족한 언어는 표준화된 평가 프레임워크에서 충분히 다루어지지 않고 있습니다. 본 논문에서는 베트남어 LLM에서 발생하는 환각 현상을 감지하기 위한 최초의 대규모 공유 과제인 DSC2025 ViHallu 챌린지를 소개합니다. ViHallu 데이터셋은 (맥락, 프롬프트, 응답) 샘플 10,000개를 포함하며, 환각이 없는 경우, 내재적 환각, 외부적 환각의 세 가지 범주로 체계적으로 분류되어 있습니다. 데이터셋은 사실 기반, 노이즈, 적대적 프롬프트의 세 가지 유형을 포함하여 모델의 견고성을 시험합니다. 총 111개 팀이 참여했으며, 가장 뛰어난 성능을 보인 시스템은 매크로 F1 점수가 84.80%를 기록하여, 기본 인코더 모델의 32.83%에 비해 현저히 높은 성능을 보였습니다. 이는 구조화된 프롬프팅과 앙상블 전략을 사용하는 LLM이 일반적인 아키텍처보다 훨씬 우수한 성능을 발휘한다는 것을 보여줍니다. 그러나 완벽한 성능과의 격차는 환각 감지가 여전히 어려운 문제임을 시사하며, 특히 내재적 환각(모순 기반 환각)의 경우 더욱 그렇습니다. 본 연구는 엄격한 벤치마크를 구축하고 다양한 감지 방법론을 탐구하여, 베트남어 AI 시스템의 신뢰성과 안정성에 대한 향후 연구의 기반을 마련합니다.

Original Abstract

The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations -- fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types -- factual, noisy, and adversarial -- to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!