Research Datasets & Models

Over the past 15 years I have developed a wide range of language resources, tools, and models that support research in NLP and multilingual NLP,
with particular attention to low-resource languages and domain-specific applications in finance, medicine, and education.
All resources are hosted on GitHub and HuggingFace for stable versioning, citation, and long-term access.

For a broader overview of my Arabic NLP resources, corpora and tools, you can also visit my dedicated resources portal at ArabicNLP.uk.

Resource	GitHub	HuggingFace	Paper(s)
Tarab: A Multi-Dialect Corpus of Arabic Lyrics and Poetry - 2.6M lyrics and poems verses	—
FreeTxt-Vi: A Benchmarked Vietnamese–English Toolkit for Segmentation, Sentiment, and Summarisation		—
VietJobs: A Vietnamese Job Advertisement Dataset		—
AraFinNews – 212k Arabic Financial News
ArabJobs – Arabic Job Advertisements Corpus
Multilingual Corpus of World’s Constitutions (MCWC)
MCWC Constitutions fine-tuned trnalsators [Marian MT Models] (ar↔en, ar↔es, en↔es)
Habibi – Arabic Song Lyrics Corpus
KALIMAT – Multipurpose Arabic Corpus
EASC – Essex Arabic Summaries Corpus
Arabic Management & Economic News (1200 articles)	—
Arabic Dialects Dataset (GULF, EGYPT, LEVANT, TUNISIA)	—
Multilingual Summarisation Corpus	—
FinAraT5 – Arabic Financial Summarisation Model	—
Vietnamese FreeTxt		—	—
Welsh Summarization Dataset		—
Welsh Thesaurus		—
Welsh FreeTxt Language Resources		—
CFIE-FRSE – Financial Report Structure Extractor		—
DARES: Dataset for Arabic Readability Estimation		—
OSMAN – Arabic Readability Metric and UN Parrallel Corpus		—