2601.21115v1 Jan 28, 2026 cs.CL

멀티 태스크 코드 LLM: 데이터 혼합인가, 모델 병합인가?

Multi-task Code LLMs: Data Mix or Model Merge?

Boris Sobolev

Citations: 71

h-index: 1

Mingzhi Zhu

Citations: 32

h-index: 3

Rahul Krishna

Citations: 53

h-index: 4

Raju Pavuluri

Citations: 346

h-index: 8

Stacy Patterson

Citations: 48

h-index: 4

Michele Merler

Citations: 132

h-index: 1

최근 연구에서는 최첨단 모델과 함께, 에이전트 프레임워크에서 더 작고 특화된 코드 LLM을 사용하는 것이 효율적인 전략이며, 이는 성능, 제약 조건 및 비용의 균형을 맞추는 멀티 태스크 학습에 대한 관심을 불러일으키고 있습니다. 본 연구에서는 작은 크기의 멀티 태스크 코드 LLM을 구축하는 두 가지 접근 방식을 비교합니다: 데이터 혼합과 모델 병합. Qwen Coder 및 DeepSeek Coder 모델 아키텍처의 두 가지 크기(2B 및 7B 파라미터)를 사용하여 코드 생성 및 코드 요약 작업에 대해 파인튜닝을 수행하고, HumanEval, MBPP 및 CodeXGlue 벤치마크를 사용하여 성능을 평가했습니다. 그 결과, 모델 병합은 더 큰 규모에서 모델 아키텍처 전반에 걸쳐 가장 우수한 성능을 보였으며, 코드 생성 작업에서 특화된 모델의 성능의 96%를 유지하면서 요약 기능을 유지했습니다. 주목할 만한 점은 병합된 모델이 개별적으로 파인튜닝된 모델보다 더 나은 성능을 보이는 경우도 있었습니다. 예를 들어, Qwen Coder 2.5 7B 모델의 최적 구성은 HumanEval에서 92.7%의 Pass@1을 달성한 반면, 해당 작업에 특화되어 파인튜닝된 모델은 90.9%를 달성했습니다. 작은 규모에서는 데이터 혼합이 더 선호되는 전략으로 나타났습니다. 또한, 다양한 작업이 모델 파라미터에 미치는 영향을 이해하고 병합 전략에 대한 시사점을 분석하기 위해 가중치 분석 기술을 도입했습니다. 결과는 신중한 병합 및 혼합 전략을 통해 성능 저하 없이 작업별 기능을 효과적으로 결합할 수 있으며, 이는 자원 제약이 있는 배포 시나리오에 이상적이라는 것을 시사합니다.

Original Abstract

Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!