Research Datasets & Models

Over the past 15 years I have developed a wide range of language resources, tools, and models that support research in NLP and multilingual NLP,
with particular attention to low-resource languages and domain-specific applications in finance, medicine, and education.
All resources are hosted on GitHub and HuggingFace for stable versioning, citation, and long-term access.

For a broader overview of my Arabic NLP resources, corpora and tools, you can also visit my dedicated resources portal at ArabicNLP.uk.

Resource GitHub HuggingFace Paper(s)
AraFinNews – 212k Arabic Financial News
ArabJobs – Arabic Job Advertisements Corpus
Multilingual Corpus of World’s Constitutions (MCWC)
MCWC Constitutions fine-tuned trnalsators [Marian MT Models] (ar↔en, ar↔es, en↔es)
Habibi – Arabic Song Lyrics Corpus
KALIMAT – Multipurpose Arabic Corpus
EASC – Essex Arabic Summaries Corpus
Arabic Management & Economic News (1200 articles)
Arabic Dialects Dataset (GULF, EGYPT, LEVANT, TUNISIA)
Multilingual Summarisation Corpus
FinAraT5 – Arabic Financial Summarisation Model
Vietnamese FreeTxt
Welsh Summarization Dataset
Welsh Thesaurus
Welsh FreeTxt Language Resources
CFIE-FRSE – Financial Report Structure Extractor
DARES: Dataset for Arabic Readability Estimation
OSMAN – Arabic Readability Metric and UN Parrallel Corpus