Towards Authorship Attribution in Arabic Short-Microblog Text

Kamal Mansour Jambi, Imtiaz Hussain Khan*, Muazzam Ahmed Siddiqui, Salma Omar Alhaj

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained 65% accuracy, whereas KNN hardly attained 35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.

Original languageEnglish
Pages (from-to)128506-128520
Number of pages15
JournalIEEE Access
Volume9
DOIs
Publication statusPublished - 13 Sept 2021
Externally publishedYes

Keywords

  • Arabic microblogs
  • Authorship attribution
  • classification
  • grid search CV

Cite this