The DiaCorpus Project

A collaborative project between the Data Science Institute (DSI) and Israeli Innovation authority to create Dialectic Arabic Corpora, a first of a kind Arabic textual repository, in a local dialect (Israeli / Palestinian). The project is part of the National Language Processing plan of Israel.

Goal

Development of a large, comprehensive, and annotated corpora in Palestinian Arabic dialect. Part of the national NLP program.

Method

Recorded speech and textual data sources of dialectic Arabic will be used in the creation of the corpora. Data will be collected from a variety of publicly available datasets as well as generated at the Data Science Institute. The data will be cleansed, tagged for a variety of NLP tasks such as sentiment analysis, name entity recognition (NER), co-reference, emotion recognition, and summarization, and placed into a broadly available, properly annotated corpora that can be easily accessed for NLP application developments.

The National NLP Program

As part of the technological advances in machine learning, there have been significant developments in the Natural Language Processing (NLP) and a flurry of academic and industry research efforts are maturing into products which play a significant role in the day to day lives of people worldwide (e.g. digital personal assistants, advanced search engines, automatic translation services, etc.). However, with the vast majority of NLP tools being developed in English, NLP applications in other languages usually lag behind in terms of robustness, accuracy and efficiency.

In Israel, the two national languages, Hebrew and Arabic have also been affected by this disparity, and although some initial developments in the Hebrew language have recently started to emerge (e.g. speech to text), the two languages are still far behind their English and Indo-European counterparts.

The main challenges of NLP developments in Hebrew and Arabic are the inherent semantic differences between English and the two morphologically-rich semitic languages, which make language adaptation even more complicated and costly to implement, as well as the scarcity of large Hebrew and dialectic Arabic datasets needed to develop and train NLP models.

The National Natural Language Processing Program of Israel (NNLP-IL) is a national initiative for the creation of infrastructure, research and development of advanced capabilities for the advancement of the field of NLP in Hebrew and dialectic Arabic. Once implemented, this infrastructure will facilitate the development of a variety of NLP applications such as chatbots, prediction models and language models.

Guiding Principles

Generic frameworks that will allow fitting and customizing solutions to various applications (without focusing on specific use cases).
Open sourced (as much as possible) - Everyone can take part, contribute and use.
Break through the data barrier - creating tagged and untagged datasets and making them accessible to the general public.
Usability - distributing capabilities through user manuals, code repositories, and more.

The DSI has been selected to develop one of the modules in this national program: The creation of dialectic Arabic corpora and datasets.

Resources

For more information on the National Program for Natural Language Processing: https://www.nationalplanil.ai/.