2605.07731v1 May 08, 2026 cs.CL

EngGPT2-16B-A3B 모델의 성능 평가: 이탈리아 및 국제 오픈 소스 LLM과의 비교

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

Andrea Sassella

Citations: 4

h-index: 1

M. Carman

Citations: 27

h-index: 2

A. Chizzola

Citations: 155

h-index: 4

Tommaso Bianchi

Citations: 8

h-index: 1

L. Alessandrelli

Citations: 10

h-index: 2

본 보고서는 ENGINEERING Ingegneria Informatica S.p.A.에서 개발한 160억 개의 파라미터를 가진 Mixture of Experts (MoE) 모델인 EngGPT2MoE-16B-A3B의 성능을 평가합니다. 이 모델은 30억 개의 활성 파라미터를 사용합니다. 다양한 대표적인 벤치마크를 통해 성능을 조사하고, 유사한 크기의 오픈 소스 MoE 및 dense 모델과 비교합니다. EngGPT2MoE-16B-A3B는 인기 있는 이탈리아 모델(FastwebMIIA-7B, Minerva-7B, Velvet-14B, LLaMAntino-3-ANITA-8B)과 비교했을 때, ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, HumanEval (HE) 등 국제 벤치마크에서 동등하거나 더 나은 성능을 보였습니다. 또한 RULER 벤치마크에서 가장 긴 컨텍스트 설정(32k)에서 최고의 성능을 달성했습니다. 이탈리아 벤치마크 데이터셋 ITALIC에서, 이 모델은 Velvet-14B를 제외하고 다른 모델들과 비교하여 동등하거나 더 나은 성능을 보였습니다. 비교 가능한 크기의 다른 MoE 모델과 비교했을 때, EngGPT2MoE-16B-A3B는 DeepSeek-MoE-16B-Chat보다 모든 벤치마크에서 더 높은 값을 나타냈습니다. Moonlight-16B-A3B와 비교했을 때, HE, MMLU, AIME24, AIME25, GSM8K, 32k RULER 설정에서는 더 높은 값을 보였지만, BFCL 및 일부 ARC 및 ITALIC 설정에서는 더 낮은 값을 보였습니다. GPT-OSS-20B와 비교했을 때, 대부분의 벤치마크(HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, RULER 32k 포함)에서 더 낮은 값을 보였습니다. 인기 있는 dense 모델과 비교했을 때, EngGPT2MoE-16B-A3B는 Llama-3.1-8B-Instruct, Gemma-3-12b-it, Ministral-3-8BInstruct-2512-BF16보다 AIME24 및 AIME25에서 더 높은 값을 보였지만, ITALIC, BFCL, 32k 컨텍스트의 RULER에서는 더 낮은 값을 보였습니다. 모든 벤치마크 지표를 종합적으로 고려했을 때, EngGPT2MoE-16B-A3B는 평가 대상 이탈리아 모델보다 더 높은 성능을 보였지만, GPT-5 nano 및 Qwen3-8B와 같은 일부 최고 성능의 국제 모델보다는 낮은 성능을 보였습니다. 종합적으로 볼 때, 본 연구 결과는 EngGPT2MoE-16B-A3B가 이탈리아어 기반 대규모 언어 모델 분야에서 중요한 진전을 이루었음을 시사합니다.

Original Abstract

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!