Towards Authorship Attribution in Arabic Short-Microblog Text

Kamal Mansour Jambi, Imtiaz Hussain Khan*, Muazzam Ahmed Siddiqui, Salma Omar Alhaj

*Awdur cyfatebol y gwaith hwn

Allbwn ymchwil: Cyfraniad at gyfnodolynErthygladolygiad gan gymheiriaid

6 Dyfyniadau (Scopus)

Crynodeb

Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained 65% accuracy, whereas KNN hardly attained 35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.

Iaith wreiddiolSaesneg
Tudalennau (o-i)128506-128520
Nifer y tudalennau15
CyfnodolynIEEE Access
Cyfrol9
Dynodwyr Gwrthrych Digidol (DOIs)
StatwsCyhoeddwyd - 13 Medi 2021
Cyhoeddwyd yn allanolIe

Dyfynnu hyn