| dc.description.abstract |
Sentiment analysis is an essential natural language processing technique for monitoring online discussions about brands, products, and services. Traditionally focused on monolingual data, sentiment analysis has now expanded to include code-mixed texts, reflecting the growing use of multiple languages within single sentences on social media. This dissertation addresses the gap in sentiment analysis for code-mixed data by developing a Long Short-Term Memory (LSTM) classifier for Xitsonga-English comments extracted from YouTube music reviews. This research aims to design and implement a sentiment analysis model tailored for Xitsonga-English code-mixed texts, evaluating its performance against traditional monolingual sentiment analysis methods. This includes collecting a substantial dataset of Xitsonga-English comments, determining their polarity, developing an LSTM classifier, and assessing its accuracy, precision, recall, and F1-score. Data collection involved scraping 1 998 Xitsonga-English comments from a Xitsonga YouTube channel, cleaning and tokenizing the comments for analysis.
Sentiments were defined and categorized into positive, negative, and neutral classes based on specific criteria, with dictionaries developed for both Xitsonga and English lexicons. These lexicons were used to label the comments, facilitating the creation of training data for the LSTM model. Additionally, a word embedding matrix was developed using Word2Vec, capturing semantic similarities between words. The LSTM classifier's architecture included embedding layers initialized with pre-trained word embeddings, two LSTM layers for sequence processing, and a dense output layer for sentiment classification. Despite efforts to address overfitting through regularization and model adjustments, the final LSTM model did not perform as expected on the validation and test datasets, highlighting challenges in generalizing sentiment classification for the collected dataset. To address this, a stacking classifier combining Random Forest, Support Vector Machine, Gradient Boosting, and Logistic Regression was developed and compared with the LSTM model. The stacking classifier showed better generalization on unseen data, indicating its robustness for sentiment analysis tasks in code-mixed contexts.
The results highlight the challenges and potential solutions in developing robust sentiment analysis models for code-mixed languages, contributing valuable insights to the domain of natural language processing. |
en_US |