TY - JOUR
T1 - Towards Authorship Attribution in Arabic Short-Microblog Text
AU - Jambi, Kamal Mansour
AU - Khan, Imtiaz Hussain
AU - Siddiqui, Muazzam Ahmed
AU - Alhaj, Salma Omar
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2021/9/13
Y1 - 2021/9/13
N2 - Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained 65% accuracy, whereas KNN hardly attained 35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.
AB - Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained 65% accuracy, whereas KNN hardly attained 35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.
KW - Arabic microblogs
KW - Authorship attribution
KW - classification
KW - grid search CV
UR - http://www.scopus.com/inward/record.url?scp=85115144699&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2021.3112624
DO - 10.1109/ACCESS.2021.3112624
M3 - Article
AN - SCOPUS:85115144699
SN - 2169-3536
VL - 9
SP - 128506
EP - 128520
JO - IEEE Access
JF - IEEE Access
ER -