Research Datasets & Models
Over the past 15 years I have developed a wide range of language resources, tools, and models that support research in NLP and multilingual NLP,
with particular attention to low-resource languages and domain-specific applications in finance, medicine, and education.
All resources are hosted on GitHub and HuggingFace for stable versioning, citation, and long-term access.
For a broader overview of my Arabic NLP resources, corpora and tools,
you can also visit my dedicated resources portal at
ArabicNLP.uk.
| Resource |
GitHub |
HuggingFace |
Paper(s) |
| AraFinNews – 212k Arabic Financial News |
|
|
|
| ArabJobs – Arabic Job Advertisements Corpus |
|
|
|
| Multilingual Corpus of World’s Constitutions (MCWC) |
|
|
|
| MCWC Constitutions fine-tuned trnalsators [Marian MT Models] (ar↔en, ar↔es, en↔es) |
|
|
|
| Habibi – Arabic Song Lyrics Corpus |
|
|
|
| KALIMAT – Multipurpose Arabic Corpus |
|
|
|
| EASC – Essex Arabic Summaries Corpus |
|
|
|
| Arabic Management & Economic News (1200 articles) |
— |
|
|
| Arabic Dialects Dataset (GULF, EGYPT, LEVANT, TUNISIA) |
— |
|
|
| Multilingual Summarisation Corpus |
— |
|
|
| FinAraT5 – Arabic Financial Summarisation Model |
— |
|
|
| Vietnamese FreeTxt |
|
— |
— |
| Welsh Summarization Dataset |
|
— |
|
| Welsh Thesaurus |
|
— |
|
| Welsh FreeTxt Language Resources |
|
— |
|
| CFIE-FRSE – Financial Report Structure Extractor |
|
— |
|
| DARES: Dataset for Arabic Readability Estimation |
|
— |
|
| OSMAN – Arabic Readability Metric and UN Parrallel Corpus |
|
— |
|