TY - JOUR
T1 - Class association and attribute relevancy based imputation algorithm to reduce twitter data for optimal sentiment analysis
AU - Bibi, Maryum
AU - Nadeem, Malik Sajjad Ahmed
AU - Khan, Imtiaz Hussain
AU - Shim, Seong O.
AU - Khan, Ishtiaq Rasool
AU - Naqvi, Uzma
AU - Aziz, Wajid
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2019/9/18
Y1 - 2019/9/18
N2 - Twitter sentiment analysis is a challenging task that involves various preprocessing steps including dimensionality reduction. Dimensionality reduction helps ensure low computational complexity and performance improvement during the classification process. In Twitter data, each tweet has feature values which may or may not reflect a person's response. Therefore, a large number of sparse data points are generated when tweets are represented as feature matrix, eventually increasing computational overheads and error rates in Twitter sentiment analysis. This study proposes a novel preprocessing technique called class association and attribute relevancy based imputation algorithm (CAARIA) to reduce the Twitter data size. CAARIA achieves the dimensionality reduction goal by imputing those tweets that belong to the same class and also share useful information. The performance of two classifiers (Naïve Bayes and support vector machines) is evaluated on three Twitter datasets in terms of classification accuracy, measured as area under curve, and time efficiency. CAARIA is also compared against two widely used feature selection (dimensionality reduction) techniques, information gain (IG) and Pearson's correlation (PC). The findings reveal that CAARIA outperforms IG and PC in terms of classification accuracy and time efficiency. These results suggest that CAARIA is a robust data preprocessing technique for the classification task.
AB - Twitter sentiment analysis is a challenging task that involves various preprocessing steps including dimensionality reduction. Dimensionality reduction helps ensure low computational complexity and performance improvement during the classification process. In Twitter data, each tweet has feature values which may or may not reflect a person's response. Therefore, a large number of sparse data points are generated when tweets are represented as feature matrix, eventually increasing computational overheads and error rates in Twitter sentiment analysis. This study proposes a novel preprocessing technique called class association and attribute relevancy based imputation algorithm (CAARIA) to reduce the Twitter data size. CAARIA achieves the dimensionality reduction goal by imputing those tweets that belong to the same class and also share useful information. The performance of two classifiers (Naïve Bayes and support vector machines) is evaluated on three Twitter datasets in terms of classification accuracy, measured as area under curve, and time efficiency. CAARIA is also compared against two widely used feature selection (dimensionality reduction) techniques, information gain (IG) and Pearson's correlation (PC). The findings reveal that CAARIA outperforms IG and PC in terms of classification accuracy and time efficiency. These results suggest that CAARIA is a robust data preprocessing technique for the classification task.
KW - Classification
KW - Twitter sentiment analysis
KW - class association
KW - dimensionality reduction
KW - imputation
KW - machine learning
KW - preprocessing
UR - http://www.scopus.com/inward/record.url?scp=85077954002&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2019.2942112
DO - 10.1109/ACCESS.2019.2942112
M3 - Article
AN - SCOPUS:85077954002
SN - 2169-3536
VL - 7
SP - 136535
EP - 136544
JO - IEEE Access
JF - IEEE Access
M1 - 8843854
ER -