Developing a code-mixed sentiment analysis model for Xitsonga-English music review

Nkuna, Blessing

ULSpace Home
→
Faculty of Science and Agriculture
→
School of Mathematical & Computational Sciences
→
Theses and Dissertations (Computer Science)
→
View Item

dc.contributor.advisor	Modipa, T. I.
dc.contributor.author	Nkuna, Blessing
dc.contributor.other	Ramalepe, P. S.
dc.date.accessioned	2026-03-06T12:23:43Z
dc.date.available	2026-03-06T12:23:43Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10386/5363
dc.description	Thesis (M. Sc. (Computer Science)) -- University of Limpopo, 2025	en_US
dc.description.abstract	Sentiment analysis is an essential natural language processing technique for monitoring online discussions about brands, products, and services. Traditionally focused on monolingual data, sentiment analysis has now expanded to include code-mixed texts, reflecting the growing use of multiple languages within single sentences on social media. This dissertation addresses the gap in sentiment analysis for code-mixed data by developing a Long Short-Term Memory (LSTM) classifier for Xitsonga-English comments extracted from YouTube music reviews. This research aims to design and implement a sentiment analysis model tailored for Xitsonga-English code-mixed texts, evaluating its performance against traditional monolingual sentiment analysis methods. This includes collecting a substantial dataset of Xitsonga-English comments, determining their polarity, developing an LSTM classifier, and assessing its accuracy, precision, recall, and F1-score. Data collection involved scraping 1 998 Xitsonga-English comments from a Xitsonga YouTube channel, cleaning and tokenizing the comments for analysis. Sentiments were defined and categorized into positive, negative, and neutral classes based on specific criteria, with dictionaries developed for both Xitsonga and English lexicons. These lexicons were used to label the comments, facilitating the creation of training data for the LSTM model. Additionally, a word embedding matrix was developed using Word2Vec, capturing semantic similarities between words. The LSTM classifier's architecture included embedding layers initialized with pre-trained word embeddings, two LSTM layers for sequence processing, and a dense output layer for sentiment classification. Despite efforts to address overfitting through regularization and model adjustments, the final LSTM model did not perform as expected on the validation and test datasets, highlighting challenges in generalizing sentiment classification for the collected dataset. To address this, a stacking classifier combining Random Forest, Support Vector Machine, Gradient Boosting, and Logistic Regression was developed and compared with the LSTM model. The stacking classifier showed better generalization on unseen data, indicating its robustness for sentiment analysis tasks in code-mixed contexts. The results highlight the challenges and potential solutions in developing robust sentiment analysis models for code-mixed languages, contributing valuable insights to the domain of natural language processing.	en_US
dc.format.extent	ix, 85 leaves	en_US
dc.language.iso	en	en_US
dc.relation.requires	PDF	en_US
dc.subject	Sentiment analysis	en_US
dc.subject	Code-mixed	en_US
dc.subject	Polarity	en_US
dc.subject	Annotated	en_US
dc.subject	Long short-term memory classifier	en_US
dc.subject	Stacking classifier	en_US
dc.subject.lcsh	Sentiment analysis	en_US
dc.subject.lcsh	Deep learning (Machine learning)	en_US
dc.subject.lcsh	Code switching (Linguistics)	en_US
dc.title	Developing a code-mixed sentiment analysis model for Xitsonga-English music review	en_US
dc.type	Thesis	en_US