Comparative study on feature selection and ensemble methods for sentiment analysis classification
Date
2020-06-10
Authors
Zahir Mohammad Adnan Younis
زاهر "محمد عدنان" يونس
Journal Title
Journal ISSN
Volume Title
Publisher
AL-quds University
Abstract
People use the Web and social media to express their opinions and comments on various
topics and posts generating huge amounts of data. Hence, comes the necessity to analyze
this large amount of text regarding a certain subject and figuring out what people think of
it. The interest and necessity of this analysis is continuously rising in many fields, such as
politics, marketing, entertainment, sports, etc., to figure out people opinions, thinking,
interests, preferences, and trends. Consequently, analysis, classification and clustering of
this huge amount of text data regarding certain subjects became an interest of a vast
number of researchers and beneficiaries. This analysis of text data content is known as
sentiment analysis.
Sentiment Analysis (SA) is a text-mining field that computationally treats and analyses
these sentiments (opinions, thinks, subjectivity, interests, preferences, etc.,.) of available
text. SA aims to classify expressions in a text as positive, negative or neutral opinion
towards the subject of interest.
The main objective of this research is to carry out a comparative study on the accuracy and
performance of feature selection and ensemble methods for SA classification. The
comparison was carried out using different combinations of classification algorithms for
classifying text to being either positive or negative.
During the comparison of the algorithms and methods, the results showed that better
accuracy can be achieved based on the used feature selection method (i.e., statistical,
wrapper, or embedded). Additionally, it showed which feature selection method
outperforms and is more suitable than other methods for the type of data and classification
algorithms. Furthermore, when using combined ensemble methods (Bagging, Boosting,
Stacking and Vote) performed better than using a single classifier by means of accuracy.
Moreover, merging feature subsets selected by embedded method improved classification
accuracy. Finally, tuning the parameters of feature selection methods improved the
classification accuracy and reduced the time needed to select feature subsets.
Particularly, the results showed that accuracy depends on the feature selection method,
ensemble methods, number of selected features, type of classifier, and tuning parameters of
the algorithms used. A high accuracy of up to 99.85% was achieved by merging features of
two embedded methods when using stacking ensemble method. Also, a high accuracy of
99.5% was achieved by tuning parameters in stacking method, and it reached 99.95% and
iv
100% by tuning parameters in SVMAttributeEval method using statistical and machine
learning approaches, respectively. Furthermore, tuning algorithms' parameters reduced the
time needed to select feature subsets.