2601.22888v1 Jan 30, 2026 cs.CL

LLM은 사용자의 언어 방식을 어떻게 생성해야 할까? MDial을 활용하여 미국 표준 영어에 국한되지 않은 방언 정확성을 갖춘 대화 데이터 구축

Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

Steven Euijong Whang

Korea Advanced Institute of Science and Technology

Citations: 5,291

h-index: 28

Jio Oh

Citations: 22

h-index: 2

Dezhi Hong

Citations: 0

h-index: 0

Paul Vicinanza

Citations: 0

h-index: 0

Thomas Butler

Citations: 0

h-index: 0

Amani Namboori

Citations: 10

h-index: 2

16억 명의 영어 사용 인구 중 80% 이상이 미국 표준 영어(SAE)를 사용하지 않으며, 이들은 LLM과의 상호작용에서 더 높은 실패율과 고정관념적인 답변을 경험합니다. 하지만 다방언 성능에 대한 연구는 아직 부족합니다. 본 논문에서는 세 가지 주요 방언 특징(어휘, 철자, 형태-통사)을 포괄하는 9개의 영어 방언에 대한 대화 데이터 생성의 첫 번째 대규모 프레임워크인 MDial을 소개합니다. 원어민 언어학자들과 협력하여 설계된, 주석이 달리고 확장 가능한 규칙 기반 LLM 변환 기술을 통해 정확성을 확보했습니다. 우리의 접근 방식은 모델이 사용자의 형태-통사적 특징을 그대로 반영해야 한다는 가정에 도전하며, 방언의 문법적 특징 중 최대 90%는 모델에 의해 재현될 필요가 없음을 보여줍니다. 독립적인 평가 결과, 데이터 품질이 우수하며, 평가자들은 98%의 쌍대 비교에서 MDial의 결과물이 기존 방법보다 방언의 자연스러움 측면에서 더 뛰어나다고 평가했습니다. 이 파이프라인을 사용하여 50,000개 이상의 대화로 구성된 방언 병렬 벤치마크인 MDialBench를 구축하고, 97,000개 이상의 질의응답 쌍을 생성했으며, 17개의 LLM을 방언 식별 및 응답 생성 작업에 대해 평가했습니다. 최첨단 모델조차도 70% 미만의 정확도를 보였으며, 캐나다 영어의 경우 50% 미만에 머물렀고, 미국 표준 영어 이외의 방언을 미국 또는 영국 영어로 체계적으로 오분류했습니다. 방언 식별은 자연어 이해의 기초가 되므로, 이러한 오류는 후속 작업에서 심각한 문제를 야기할 수 있습니다.

Original Abstract

More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users' morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.

0 Citations

0 Influential

14 Altmetric

70.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!