Show simple item record

dc.contributor.advisor Manamela, M. J. D.
dc.contributor.author Sindana, Daniel
dc.contributor.other Modipa, T. I.
dc.date.accessioned 2021-07-29T08:50:55Z
dc.date.available 2021-07-29T08:50:55Z
dc.date.issued 2020
dc.identifier.uri http://hdl.handle.net/10386/3413
dc.description Thesis (M.Sc.(Computer Science )) -- University of Limpopo, 2020 en_US
dc.description.abstract Language modelling (LM) work for under-resourced languages that does not consider most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural language processing tools and systems such as speech recognition systems. This study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied. The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits. From the findings of the study – found on text preparation, data pooling and sizing, higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers and annotation tags that may incorrectly form part of sentences, word sequences, and n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele texts. 4) Investigate efficient smoothing method for the different languages, especially for different text sizes and different text domains en_US
dc.description.sponsorship National Research Foundation (NRF) Telkom University of Limpopo en_US
dc.format.extent x, 97 leaves en_US
dc.language.iso en en_US
dc.relation.requires PDF en_US
dc.subject Language modelling en_US
dc.subject Natural language processing en_US
dc.subject Automatic speech recognition en_US
dc.subject Under-resourced languages en_US
dc.subject.lcsh Robust control en_US
dc.subject.lcsh Automatic speech recognition en_US
dc.subject.lcsh Speech perception en_US
dc.title Development of robust language models for speech recognition of under-resourced language en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search ULSpace


Browse

My Account