Abstract:
Language modelling (LM) work for under-resourced languages that does not consider
most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural
language processing tools and systems such as speech recognition systems. This
study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the
text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied.
The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The
toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits.
From the findings of the study – found on text preparation, data pooling and sizing,
higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models
created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers
and annotation tags that may incorrectly form part of sentences, word sequences, and
n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and
hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele
texts. 4) Investigate efficient smoothing method for the different languages, especially
for different text sizes and different text domains