2602.22207v1 Feb 25, 2026 cs.CL

번역을 통해 회복하다: 벤치마크 및 데이터 세트의 자동 번역을 위한 효율적인 파이프라인

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Martin T. Vechev

Citations: 17,053

h-index: 67

Hanna Yukhymenko

Citations: 38

h-index: 1

Anton Alexandrov

Citations: 282

h-index: 5

다국어 대규모 언어 모델(LLM) 평가의 신뢰성은 현재 번역된 벤치마크의 품질 불일치로 인해 저해되고 있습니다. 기존 자료는 종종 의미론적 변화와 문맥 손실을 겪으며, 이는 오해의 소지가 있는 성능 지표로 이어질 수 있습니다. 본 연구에서는 이러한 문제점을 해결하기 위해 확장 가능하고 고품질의 데이터 세트 및 벤치마크 번역을 가능하게 하는 완전 자동화된 프레임워크를 제시합니다. 테스트 시간 계산 확장 전략, 특히 Universal Self-Improvement (USI) 및 우리가 제안하는 다단계 순위 방법인 T-RANK를 적용하면 기존 파이프라인에 비해 훨씬 높은 품질의 결과물을 얻을 수 있음을 보여줍니다. 당사의 프레임워크는 벤치마크가 현지화 과정에서 원래의 작업 구조와 언어적 뉘앙스를 유지하도록 보장합니다. 이 접근 방식을 사용하여 인기 있는 벤치마크 및 데이터 세트를 동유럽 및 남유럽의 8개 언어(우크라이나어, 불가리아어, 슬로바키아어, 루마니아어, 리투아니아어, 에스토니아어, 터키어, 그리스어)로 번역했습니다. 참조 기반 지표 및 LLM-as-a-judge를 사용한 평가 결과, 당사의 번역 결과가 기존 자료보다 우수하며, 이는 다운스트림 모델 평가의 정확성을 향상시킵니다. 당사는 견고하고 재현 가능한 다국어 AI 개발을 촉진하기 위해 프레임워크와 개선된 벤치마크를 모두 공개합니다.

Original Abstract

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!