2604.19098v1 Apr 21, 2026 cs.CL

SAHM: 아랍어 금융 및 샤리아 법규 준수 추론을 위한 벤치마크

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

Sophia Ananiadou

Citations: 28

h-index: 3

Preslav Nakov

Citations: 7,695

h-index: 48

Rania Elbadry

Citations: 2

h-index: 1

S. Lahlou

Citations: 1,605

h-index: 9

Xueqing Peng

Citations: 512

h-index: 12

Jimin Huang

Citations: 158

h-index: 6

Zhuohan Xie

Citations: 263

h-index: 9

Veselin Stoyanov

Citations: 12

h-index: 2

Ahmed Heakl

Citations: 227

h-index: 7

Sarfraz Ahmad

Citations: 14

h-index: 2

Dani Bouch

Citations: 1

h-index: 1

Momina Ahsan

Citations: 14

h-index: 2

Muhra AlMahri

Citations: 8

h-index: 1

Marwa Elsaid khalil

Citations: 0

h-index: 0

Yuxia Wang

Citations: 907

h-index: 15

영어 금융 자연어 처리 분야는 감성 분석, 문서 이해, 금융 질의 응답 등의 벤치마크를 통해 빠르게 발전해 왔지만, 아랍어 금융 자연어 처리 분야는 신뢰할 수 있는 금융 및 이슬람 금융 지원 시스템에 대한 높은 실용적 수요에도 불구하고 상대적으로 연구가 부족한 실정입니다. 본 논문에서는 아랍어 금융 자연어 처리 및 샤리아 법규 준수 추론을 위한 문서 기반 벤치마크 및 지시문 튜닝 데이터셋인 SAHM을 소개합니다. SAHM은 14,380개의 전문가 검증된 데이터로 구성되어 있으며, AAOIFI 표준 질의 응답, 율법 기반 질의 응답/객관식 문제, 회계 및 경영 시험, 금융 감성 분석, 추출 요약, 사건-원인 추론 등 7가지 과제를 포함합니다. 데이터는 실제 규제, 법률, 기업 자료에서 수집되었습니다. 우리는 19개의 강력한 공개 및 독점 LLM을 과제별 지표 및 rubrics 기반의 채점 방법을 사용하여 평가했으며, 아랍어 구사 능력이 반드시 증거 기반 금융 추론으로 이어지지 않는다는 것을 확인했습니다. 모델은 생성 및 인과 추론보다 인식 유형의 과제에서 더 높은 성능을 보였으며, 특히 사건-원인 추론에서 가장 큰 격차가 나타났습니다. 우리는 본 벤치마크, 평가 프레임워크 및 지시문 튜닝 모델을 공개하여 향후 신뢰할 수 있는 아랍어 금융 자연어 처리 연구를 지원하고자 합니다.

Original Abstract

English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

1 Citations

0 Influential

24 Altmetric

121.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!