2605.05267v1 May 06, 2026 cs.SE

세대 간 격차 해소 및 교육: 코드 생성을 위한 LLM의 품질 문제에 대한 체계적인 검토

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Peiliang Cai

Citations: 47

h-index: 4

Mingwei Liu

Citations: 185

h-index: 7

Yanlin Wang

Citations: 651

h-index: 13

Zibin Zheng

Citations: 1,144

h-index: 20

Kaifeng He

Citations: 9

h-index: 2

Xiaojun Zhang

Citations: 38

h-index: 3

Chong Wang

Citations: 199

h-index: 5

Kaifeng Huang

Citations: 544

h-index: 13

Bihuan Chen

Citations: 3,776

h-index: 27

Xin Peng

Citations: 451

h-index: 10

대규모 언어 모델(LLM)은 코드 생성 작업에서 논리적 오류부터 보안 취약점에 이르기까지 다양한 결함이 있는 출력을 자주 생성합니다. 이러한 생성 실패는 종종 모델 자체의 한계로 간주되지만, 경험적 증거는 점점 더 이러한 실패의 근본 원인이 학습 데이터 내의 불완전성에 있음을 보여줍니다. 그러나 학습 데이터 품질 문제와 생성된 코드 품질 문제 간의 구체적인 연결 메커니즘은 아직 대부분 밝혀지지 않았습니다. 본 논문에서는 114편의 주요 연구를 대상으로 학습 데이터 품질 문제가 코드 생성에 미치는 영향을 조사하기 위한 체계적인 문헌 검토를 수행했습니다. 우리는 생성된 코드 품질 문제를 9가지 차원으로 분류하고, 학습 데이터 품질 문제를 코드 관련 속성과 비-코드 관련 속성으로 분류하는 통합적인 분류 체계를 확립했습니다. 이 분류 체계를 바탕으로, 18가지의 일반적인 전파 메커니즘을 상세히 설명하는 인과 관계 프레임워크를 제시합니다. 또한, 데이터, 모델 및 생성 수명 주기 전반에 걸쳐 최첨단 탐지 및 완화 기술을 종합적으로 분석합니다. 검토된 문헌은 명확한 방법론적 변화를 보여줍니다. 품질 보증은 사후 생성 필터링과 같은 반응적이고 휴리스틱 기반 방식에서 데이터 중심의 선제적 관리 및 폐쇄 루프 수정으로 전환되고 있습니다. 마지막으로, 우리는 개방적인 과제를 식별하고, 통합된 데이터 큐레이션과 지속적인 평가를 통해 신뢰할 수 있는 코드용 LLM을 개발하기 위한 연구 방향을 제시합니다. 저희의 자료는 https://github.com/SYSUSELab/From-Data-to-Code 에서 확인할 수 있습니다.

Original Abstract

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore, we synthesize state-of-the-art detection and mitigation techniques across the data, model, and generation lifecycles. The reviewed literature reveals a clear methodological shift: quality assurance is transitioning from reactive, heuristic-based post-generation filtering toward proactive, data-centric governance and closed-loop repair. Finally, we identify open challenges and outline research directions for developing reliable LLMs for code through integrated data curation and continuous evaluation. Our repository is available at https://github.com/SYSUSELab/From-Data-to-Code.

1 Citations

0 Influential

40.431471805599 Altmetric

203.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!