2603.16397v1 Mar 17, 2026 cs.CL

Fanar 2.0: 아랍어 기반 생성형 AI 플랫폼

Fanar 2.0: Arabic Generative AI Stack

Enes Altinisik

Citations: 244

h-index: 8

Masoomali Fatehkia

Citations: 495

h-index: 9

Majd Hawasly

Citations: 624

h-index: 13

H. Sencar

Citations: 4,594

h-index: 32

Ehsaneddin Asgari

Citations: 192

h-index: 7

Hamdy Mubarak

Citations: 5,153

h-index: 37

Kareem Darwish

Citations: 187

h-index: 4

Tasnim Mohiuddin

Citations: 394

h-index: 11

Ummar Abbas

Citations: 1

h-index: 1

M. S. Ahmad

Citations: 63

h-index: 1

M. Ahmad

Citations: 6

h-index: 2

Abdulaziz Yousuf Al-Homaid

Citations: 26

h-index: 1

Anas Al-Nuaimi

Citations: 334

h-index: 8

Sanjay Chawla

Citations: 6

h-index: 2

Shammur A. Chowdhury

Citations: 536

h-index: 12

Fahim Dalvi

Citations: 3,292

h-index: 29

Nadir Durrani

Qatar Computing Research Institute

Citations: 4,767

h-index: 37

Mohamed Elfeky

Citations: 0

h-index: 0

A. Elmagarmid

Citations: 38,347

h-index: 65

M. Eltabakh

Citations: 1,146

h-index: 16

Asim Ersoy

Citations: 3

h-index: 1

Mohammed Qusay Hashim

Citations: 0

h-index: 0

Mohamed Hefeeda

Citations: 37

h-index: 3

Mus'ab Husaini

Citations: 74

h-index: 3

Keivin Isufaj

Citations: 31

h-index: 2

Soon-gyo Jung

Citations: 4,105

h-index: 34

H. Lachemat

Citations: 2

h-index: 1

J. Lucas

Citations: 847

h-index: 7

Abubakr Mohamed

Citations: 86

h-index: 3

Basel Mousi

Citations: 218

h-index: 7

Ahmad Musleh

Citations: 68

h-index: 2

M. Ouzzani

Citations: 26,466

h-index: 44

Amin Sadeghi

Citations: 67

h-index: 2

Mohammed Shinoy

Citations: 311

h-index: 5

Omar Sinan

Citations: 61

h-index: 1

Yifan Zhang

Citations: 7

h-index: 1

Mohammad Mahdi Abootorabi

Citations: 57

h-index: 3

본 논문에서는 카타르의 아랍어 중심 생성형 AI 플랫폼인 Fanar 2.0의 두 번째 버전을 소개합니다. Fanar 2.0은 주권 확보를 최우선 설계 원칙으로 삼아, 데이터 파이프라인부터 배포 인프라까지 모든 구성 요소가 QCRI(Qatar Computing Research Institute) 및 Hamad Bin Khalifa University에서 설계 및 운영되었습니다. Fanar 2.0은 제한된 자원 속에서 뛰어난 성과를 달성한 사례입니다. 4억 명의 사용자를 보유하고 있음에도 불구하고 아랍어는 웹 데이터의 약 0.5%에 불과했습니다. Fanar 2.0은 데이터 품질 중시, 타겟 지속적 사전 학습, 모델 병합 전략을 채택하여 이러한 제약 조건 내에서 상당한 성능 향상을 이루었습니다. 핵심 모델인 Fanar-27B는 1200억 개의 고품질 토큰으로 구성된 세 가지 데이터 세트를 기반으로 Gemma-3-27B 모델을 지속적으로 사전 학습했습니다. Fanar 1.0에 비해 8배 적은 사전 학습 토큰을 사용했음에도 불구하고, 아랍어 지식(+9.1점), 언어 이해(+7.3점), 방언 이해(+3.5점), 영어 능력(+7.6점) 등 다양한 벤치마크에서 상당한 개선을 보였습니다. 핵심 LLM 외에도 Fanar 2.0은 다양한 새로운 기능을 제공합니다. FanarGuard는 아랍어의 안전 및 문화적 적합성을 위한 최첨단 4B 양방향 검열 필터입니다. Aura 음성 모델은 장시간 오디오 처리를 위한 ASR 모델을 추가했습니다. Oryx 비전 모델은 아랍어 인식을 기반으로 한 이미지 및 비디오 이해 기능을 제공하며, 문화적으로 적절한 이미지 생성 기능도 포함합니다. 또한, 에이전트 기반 툴 호출 프레임워크를 통해 다단계 워크플로우를 지원합니다. Fanar-Sadiq은 이슬람 콘텐츠를 위한 멀티 에이전트 아키텍처를 활용하고, Fanar-Diwan은 고전 아랍어 시 생성 기능을 제공합니다. Fanar-Shaheen은 LLM 기반의 양방향 번역 기능을 제공하며, 재설계된 멀티 레이어 오케스트레이터는 모든 구성 요소를 의도 기반 라우팅 및 심층 방어 안전 검증을 통해 통합 관리합니다. 종합적으로, Fanar 2.0은 주권 확보 및 제한된 자원을 활용한 AI 개발이 훨씬 더 큰 규모로 구축된 시스템과 경쟁할 수 있는 성능을 제공할 수 있음을 보여줍니다.

Original Abstract

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!