Development of an end-to-end automatic speech recognition system using connectionist temporal classification for the Tshivenda language

Mehlape, Jonas Mosweu

ULSpace Home
→
Faculty of Science and Agriculture
→
School of Mathematical & Computational Sciences
→
Theses and Dissertations (Computer Science)
→
View Item

Development of an end-to-end automatic speech recognition system using connectionist temporal classification for the Tshivenda language

Mehlape, Jonas Mosweu

URI: http://hdl.handle.net/10386/5376

Date: 2025

Abstract:

This study centers on creating an automatic speech recognition (ASR) system for Tshivenda, one of South Africa's under-resourced languages. Utilizing the Connectionist Temporal Classification (CTC) framework and the NCHLT speech corpus. This study focuses on developing an E2E ASR system leveraging CTC techniques for the Tshivenda language. The primary objective is to develop and evaluate an ASR system for the Tshivenda language using the CTC approach. This involves designing and training an ASR model using the NCHLT speech corpus, optimizing model performance through hyperparameter tuning (e.g., learning rate, dropout rate), and evaluating the system’s accuracy through essential metrics such as WER and training loss. The research also focuses on identifying key challenges in recognizing Tshivenda speech and proposes improvements for future work in this area. However, there are several delimitations to the scope of the study that should be considered. First, the research relies on the NCHLT speech corpus, which, although valuable, has limited dialectal diversity and does not fully represent all regional variations of Tshivenda. Additionally, the model was primarily trained on clean speech data, and as such, it does not extensively address the challenges of handling noisy environments or spontaneous speech. Furthermore, while the study focuses on a CTC-based deep learning model, it does not explore the integration of external language models, such as transformer-based models, which could further enhance performance. Finally, due to hardware limitations, the model was trained for 30 epochs, which may have constrained the model's ability to reach its optimal performance, potentially impacting the accuracy of the final system. The model's performance was assessed over 30 epochs using essential metrics, including Word Error Rate (WER), training loss, and validation loss. The top-performing model achieved a final WER of 0.3934, highlighting notable advancements in Tshivenda speech recognition. This research highlights the promise of deep learning models in creating ASR systems for under-resourced languages, while also pointing out critical directions for future exploration. Key advancements include expanding the dataset, integrating language models, and improving the model’s resilience to noisy conditions and spontaneous speech. These steps are essential for enhancing accuracy and practical usability. The study contributes to the broader mission of promoting language preservation and accessibility through technological innovation