BERT in the Arabic Language

Laila Alahaideb
Dec 9, 2021
4 min read

BERT is a pre-trained contextual language model that is available for several languages. It has achieved high performance in many NLP tasks. It is also a model that uses surrounding text to assist computers in comprehending the meaning of ambiguous words in the text. BERT is based on a bidirectional multilayer transformer encoder that considers both left and right contexts. It is a contextual language model that dynamically generates semantic vectors based on the context of the words. The architecture is based on multi-head attention, which allows it to record global word dependencies. The model is trained on unlabeled data in several training tasks during pre-training. It can be used as a feature extractor or can be fine-tuned.

The BERT model is fine-tuned using labeled data from downstream tasks after initializing it with the pre-trained parameters. Even though the models are starting with the same pre-trained parameters, each downstream task has its own fine-tuned model [30]. The pre-trained BERT model has proven an efficient understanding of language, and it is also considered a suitable word embedding model for morphologically rich languages such as Arabic.

Arabic BERT

Arabic BERT is a pre-trained BERT base language model for the Arabic language, and all Arabic BERT models are inspired by the original BERT. There are several models of Arabic BERT, which include the following models.

· M-BERT

Google produced multilingual BERT that supports many languages, including Arabic, which performs well in the majority of them. However, pre-training monolingual BERT for non-English languages performed better than multilingual BERT (M-BERT). M-BERT is similar to the normal BERT but trained in more than 100 languages.

· AraBERT

AraBERT is an Arabic version of BERT that is used in many Arabic NLP tasks. AraBERT is a pre-trained transformer-based model that uses the Masked Language Modelling (MLM) task and performs Farasa segmentation. AraBERT is publicly available on the internet. There are four versions of AraBERT v0.1/v1/v0.2/v2; AraBERT v0.1/v1 the original version, AraBERT v0.2/v2 is a larger version with more training, data, and vocabulary.

· ARAGP2

ARAGPT2 transformer model for the Arabic language that is a stacked transformer-decoder model, which comes in four variants (base 135M, Medium 370M, Large 792M, Mega 1.46B) trained on publicly available Arabic corpora. The Arabic corps is a collection of five Arabic corpora which are: El-Khair Corpus, OSCAR corpus, Arabic Wikipedia dump, OSIAN corpus, and News articles provided by As-Safir newspaper. The model is able to produce high-quality Arabic text and uses the perplexity measure for the evaluation.

· ARAELECTRA

ARAELECTRA is a pre-training of Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) model in a large-scale Arabic corps with 8.8 billion words. For tokenization, they used the same wordpiece vocabulary from ARABERTv0.2 and were pre-trained on the exact same dataset as ARAGPT2.

· ABioNER

Arabic Biomedical Named-Entity Recognition (ABioNer) is based on the BERT model that is inspired by AraBERT and BioBert models and pretrained on AraBERT and medical Arabic literature collected from different sources. The Biomedical BERT (BioBERT) is a pre-trained language representation model for the biomedical domain.

· ALBERT

A lite version of BERT (ALBERT) is a deep-learning NLP model. It has two versions each with four models; base, large, large, and xx-large. The large ALBERT model has around 18x fewer parameters, cross-layer parameter sharing, and reduced the size of embeddings by multiplying with embeddings and blowing up the size equal to the hidden layer vector with a matrix.

Arabic ALBERT, is a pre-trained ALBERT for the Arabic language, which is trained on 4.4B words from Arabic Wikipedia and the Arabic version of OSCAR. There are three versions of Arabic ALBERT; Base, Large, and X-Large.

Refrenses:

K. N. Elmadani, M. Elgezouli, and A. Showk, “BERT Fine-tuning For Arabic Text Summarization,” ArXiv200414135 Cs, Mar. 2020, Accessed: Oct. 14, 2021. [Online]. Available: http://arxiv.org/abs/2004.14135

H. Sun et al., “Knowledge Distillation from Bert in Pre-Training and Fine-Tuning for Polyphone Disambiguation,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2019, pp. 168–175. doi: 10.1109/ASRU46091.2019.9003918.

“RST towards a functional theory of text organization.pdf.” Accessed: Feb. 03, 2022. [Online]. Available: https://semanticsarchive.net/Archive/GMyNDBjO/RST%20towards%20a%20functional%20theory%20of%20text%20organization.pdf

AraBERTv2 / AraGPT2 / AraELECTRA. AUB MIND, 2021. Accessed: Nov. 12, 2021. [Online]. Available: https://github.com/aub-mind/arabert

W. Antoun, F. Baly, and H. Hajj, “AraGPT2: Pre-Trained Transformer for Arabic Language Generation,” ArXiv201215520 Cs, Mar. 2021, Accessed: Feb. 18, 2022. [Online]. Available: http://arxiv.org/abs/2012.15520

W. Antoun, F. Baly, and H. Hajj, “AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding,” ArXiv201215516 Cs, Mar. 2021, Accessed: Feb. 18, 2022. [Online]. Available: http://arxiv.org/abs/2012.15516

N. Boudjellal et al., “ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition,” Complexity, vol. 2021, pp. 1–6, Mar. 2021, doi: 10.1155/2021/6633213.

J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, p. btz682, Sep. 2019, doi: 10.1093/bioinformatics/btz682.

“[1909.11942] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.” https://arxiv.org/abs/1909.11942 (accessed Mar. 27, 2022).

“Arabic-ALBERT,” KUIS AI, Aug. 26, 2020. https://ai.ku.edu.tr/arabic-albert/ (accessed Mar. 27, 2022).