2604.18169v1 Apr 20, 2026 cs.CL

단순 반복을 넘어: LLM의 문학 번역에서의 이해력과 창의성을 평가하는 쌍방향 작업 프레임워크

Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation

Ran Zhang

Citations: 32

h-index: 3

Steffen Eger

Citations: 80

h-index: 5

Arda Tezcan

Citations: 437

h-index: 11

Simone Paolo Ponzetto

Citations: 34

h-index: 3

Lieve Macken

Citations: 103

h-index: 4

Wei Zhao

Citations: 206

h-index: 3

대규모 언어 모델(LLM)은 문학 번역과 같은 창의적인 작업에 점점 더 많이 사용되고 있습니다. 그러나 번역 창의성은 아직 충분히 연구되지 않았으며, 대규모로 평가되는 경우는 드뭅니다. 또한, 원문 이해는 일반적으로 독립적으로 연구되지만, 실제 전문 번역에서는 이해력과 창의성이 밀접하게 연관되어 있습니다. 본 연구에서는 이러한 격차를 해소하기 위해 11권의 책에서 발췌한 문학 작품을 활용한 쌍방향 작업 프레임워크를 적용했습니다. 1단계는 원문 이해력을 평가하고, 2단계는 은유나 언어유희와 같은 창의적 잠재 단위(UCP)를 통해 번역 창의성을 평가합니다. 전문가의 인간 평가와 UCP 기반 자동 점수를 결합한 확장 가능한 평가 시스템을 사용하여 23개의 모델과 4가지 창의성 지향 프롬프트를 비교 분석했습니다. 연구 결과, 높은 이해력이 인간 수준의 창의성으로 이어지지 않는다는 것을 보여줍니다. 모델들은 종종 문자 그대로 또는 문맥에 맞지 않는 번역을 생성하며, 특히 영어-중국어 간 번역에서 이러한 격차가 더욱 큽니다. 창의성 지향 프롬프트는 미미한 개선 효과만을 가져왔으며, Mistral-Large 모델만이 인간 수준의 창의성(0.167 vs. 0.246)에 근접했습니다. 모든 모델-프롬프트 조합에서 단 3개만이 0.1 이상의 창의성 점수를 기록했으며, 나머지 모델들은 0 또는 거의 0에 머물렀습니다.

Original Abstract

Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!