2603.14501v1 Mar 15, 2026 cs.SE

CangjieBench: 자원 부족 환경의 범용 프로그래밍 언어에 대한 LLM 성능 평가

CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language

Junhang Cheng

Citations: 21

h-index: 3

Fang Liu

Citations: 112

h-index: 7

Jia Li

Tsinghua University

Citations: 25

h-index: 3

Chengru Wu

Citations: 12

h-index: 2

Nan Jiang

Citations: 168

h-index: 3

Li Zhang

Citations: 37

h-index: 2

대규모 언어 모델(LLM)은 풍부한 자료를 가진 프로그래밍 언어에서는 뛰어난 성능을 보이지만, 자료가 부족한 언어에서는 어려움을 겪습니다. 기존 연구들은 주로 특정 분야에 특화된 언어(DSL)에 집중하여, 데이터 부족으로 인해 어려움을 겪는 범용 언어에 대한 연구는 부족했습니다. 이러한 격차를 해소하기 위해, 본 연구에서는 대표적인 자원 부족 범용 언어인 Cangjie를 위한 오염 없는 벤치마크인 CangjieBench를 소개합니다. 벤치마크는 HumanEval 및 ClassEval에서 수동으로 번역된 248개의 고품질 샘플로 구성되어 있으며, 텍스트-코드 변환 및 코드-코드 변환 작업을 모두 포함합니다. Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), 그리고 Agent의 네 가지 환경에서 다양한 LLM을 체계적으로 평가했습니다. 실험 결과, Direct Generation은 성능이 좋지 않으며, Syntax-Constrained Generation은 정확성과 계산 비용 간의 최적의 균형을 제공하는 것으로 나타났습니다. Agent는 최고 수준의 정확도를 달성했지만, 토큰 소비량이 높습니다. 또한, 코드-코드 변환은 텍스트-코드 변환보다 종종 성능이 낮아, 모델이 소스 언어의 패턴에 과적합되는 부정적인 전이 현상이 발생할 수 있음을 확인했습니다. 본 연구가 LLM의 새로운 및 자원 부족 프로그래밍 언어에 대한 일반화 능력에 대한 귀중한 통찰력을 제공할 수 있기를 바랍니다. 코드 및 데이터는 https://github.com/cjhCoder7/CangjieBench 에서 확인할 수 있습니다.

Original Abstract

Large Language Models excel in high-resource programming languages but struggle with low-resource ones. Existing research related to low-resource programming languages primarily focuses on Domain-Specific Languages (DSLs), leaving general-purpose languages that suffer from data scarcity underexplored. To address this gap, we introduce CangjieBench, a contamination-free benchmark for Cangjie, a representative low-resource general-purpose language. The benchmark comprises 248 high-quality samples manually translated from HumanEval and ClassEval, covering both Text-to-Code and Code-to-Code tasks. We conduct a systematic evaluation of diverse LLMs under four settings: Direct Generation, Syntax-Constrained Generation, Retrieval-Augmented Generation (RAG), and Agent. Experiments reveal that Direct Generation performs poorly, whereas Syntax-Constrained Generation offers the best trade-off between accuracy and computational cost. Agent achieve state-of-the-art accuracy but incur high token consumption. Furthermore, we observe that Code-to-Code translation often underperforms Text-to-Code generation, suggesting a negative transfer phenomenon where models overfit to the source language patterns. We hope that our work will offer valuable insights into LLM generalization to unseen and low-resource programming languages. Our code and data are available at https://github.com/cjhCoder7/CangjieBench.

0 Citations

0 Influential

28.993061443341 Altmetric

145.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!