The development of a text generation model for Sepedi language using transformer-based machine-learning techniques

Moila, Mahlodi Mercy

ULSpace Home
→
Faculty of Science and Agriculture
→
School of Mathematical & Computational Sciences
→
Theses and Dissertations (Computer Science)
→
View Item

dc.contributor.advisor	Manamela, M. J. D.
dc.contributor.author	Moila, Mahlodi Mercy
dc.contributor.other	Modipa, T. I.
dc.date.accessioned	2025-10-20T10:44:40Z
dc.date.available	2025-10-20T10:44:40Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10386/5131
dc.description	Thesis (M.Sc. (Computer Science)) -- University of Limpopo, 2025	en_US
dc.description.abstract	The transformer-based machine learning technique is a deep learning model that processes the sequential input data using an encoder-decoder process. Transformers process the input data simultaneously using a parallelism approach while paying attention to each word at the time by applying an attention mechanism to each unit text being processed. The transformer-based model has been known to provide more state-of-the-art performance in natural language processing (NLP) tasks than a recurrent neural network (RNN) such as Long Short-Term Memory (LSTM). RNNs have the drawback of suffering from the problem of vanishing gradients and exploding gradients in implementation. The GPT-Sepedi transformer-based model has shown great success in dealing with the process of text generation for the Sepedi language. This has led to a limited text generation system developed using a transformer-based model for the under resourced African language, namely, the Sepedi language. This research project aimed to develop a text generation model for the Sepedi language using transformer based machine learning techniques. The LSTM-Sepedi Attention-based model and the GPT-Sepedi Transformer-based model were developed and trained using a National Centre for Human Language Technology (NCHLT) Sepedi text corpus. The models were compared based on the results that they generated. A GPT-Sepedi Transformer-based model was used to generate the text. The generated text was then compared with the Sepedi language vocabulary to determine the validity of the text. It was found that 61% of the text within the generated texts is found in the Sepedi language vocabulary. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score was used to compare the model generated text to human-written text. The ROUGE score result indicates that the GPT-Sepedi Transformer-based text generation model was able to generate words that humans can write with 83% precision. Even though the precision results indicated a better percentage, the text generated cannot be comprehensible with the recall percentage of 0.05% and 0.1% F1-score results.	en_US
dc.description.sponsorship	NRF (National Research Foundation)	en_US
dc.format.extent	xii, 84 leaves	en_US
dc.language.iso	en	en_US
dc.relation.requires	PDF	en_US
dc.subject	Machine learning	en_US
dc.subject	Transformer	en_US
dc.subject	Sepedi	en_US
dc.subject	Text generation	en_US
dc.subject	GPT	en_US
dc.subject.lcsh	Deep learning (Machine learning)	en_US
dc.subject.lcsh	Northern Sotho language	en_US
dc.subject.lcsh	Machine learning	en_US
dc.subject.lcsh	Natural language generation (Computer science)	en_US
dc.title	The development of a text generation model for Sepedi language using transformer-based machine-learning techniques	en_US
dc.type	Thesis	en_US