Development of robust language models for speech recognition of under-resourced language

dc.contributor.advisorManamela, M. J. D.
dc.contributor.authorSindana, Daniel
dc.contributor.otherModipa, T. I.
dc.date.accessioned2021-07-29T08:50:55Z
dc.date.available2021-07-29T08:50:55Z
dc.date.issued2020
dc.descriptionThesis (M.Sc.(Computer Science )) -- University of Limpopo, 2020en_US
dc.description.abstractLanguage modelling (LM) work for under-resourced languages that does not consider most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural language processing tools and systems such as speech recognition systems. This study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied. The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits. From the findings of the study – found on text preparation, data pooling and sizing, higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers and annotation tags that may incorrectly form part of sentences, word sequences, and n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele texts. 4) Investigate efficient smoothing method for the different languages, especially for different text sizes and different text domainsen_US
dc.description.sponsorshipNational Research Foundation (NRF) Telkom University of Limpopoen_US
dc.format.extentx, 97 leavesen_US
dc.identifier.urihttp://hdl.handle.net/10386/3413
dc.language.isoenen_US
dc.relation.requiresPDFen_US
dc.subjectLanguage modellingen_US
dc.subjectNatural language processingen_US
dc.subjectAutomatic speech recognitionen_US
dc.subjectUnder-resourced languagesen_US
dc.subject.lcshRobust controlen_US
dc.subject.lcshAutomatic speech recognitionen_US
dc.subject.lcshSpeech perceptionen_US
dc.titleDevelopment of robust language models for speech recognition of under-resourced languageen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
sindana_d_2020.pdf
Size:
1.44 MB
Format:
Adobe Portable Document Format
Description:
Thesis

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed upon to submission
Description: