TY - JOUR
T1 - Threatening language detection from Urdu data with deep sequential model
AU - Ullah, Ashraf
AU - Khan, Khair Ullah
AU - Khan, Aurangzeb
AU - Bakhsh, Sheikh Tahir
AU - Rahman, Atta Ur
AU - Akbar, Sajida
AU - Saqia, Bibi
A2 - Rana, Toqir
N1 - Publisher Copyright:
Copyright: © 2024 Ullah et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2024/6/6
Y1 - 2024/6/6
N2 - The Urdu language is spoken and written on different social media platforms like Twitter, WhatsApp, Facebook, and YouTube. However, due to the lack of Urdu Language Processing (ULP) libraries, it is quite challenging to identify threats from textual and sequential data on the social media provided in Urdu. Therefore, it is required to preprocess the Urdu data as efficiently as English by creating different stemming and data cleaning libraries for Urdu data. Different lexical and machine learning-based techniques are introduced in the literature, but all of these are limited to the unavailability of online Urdu vocabulary. This research has introduced Urdu language vocabulary, including a stop words list and a stemming dictionary to preprocess Urdu data as efficiently as English. This reduced the input size of the Urdu language sentences and removed redundant and noisy information. Finally, a deep sequential model based on Long Short-Term Memory (LSTM) units is trained on the efficiently preprocessed, evaluated, and tested. Our proposed methodology resulted in good prediction performance, i.e., an accuracy of 82%, which is greater than the existing methods.
AB - The Urdu language is spoken and written on different social media platforms like Twitter, WhatsApp, Facebook, and YouTube. However, due to the lack of Urdu Language Processing (ULP) libraries, it is quite challenging to identify threats from textual and sequential data on the social media provided in Urdu. Therefore, it is required to preprocess the Urdu data as efficiently as English by creating different stemming and data cleaning libraries for Urdu data. Different lexical and machine learning-based techniques are introduced in the literature, but all of these are limited to the unavailability of online Urdu vocabulary. This research has introduced Urdu language vocabulary, including a stop words list and a stemming dictionary to preprocess Urdu data as efficiently as English. This reduced the input size of the Urdu language sentences and removed redundant and noisy information. Finally, a deep sequential model based on Long Short-Term Memory (LSTM) units is trained on the efficiently preprocessed, evaluated, and tested. Our proposed methodology resulted in good prediction performance, i.e., an accuracy of 82%, which is greater than the existing methods.
UR - http://www.scopus.com/inward/record.url?scp=85195398335&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0290915
DO - 10.1371/journal.pone.0290915
M3 - Article
C2 - 38843283
SN - 1932-6203
VL - 19
JO - PLoS ONE
JF - PLoS ONE
IS - 6
M1 - e0290915
ER -