Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Automatic orthography standardisation for under-resourced languages

Thesis (MEng)--Stellenbosch University, 2024.

Saved in:
Bibliographic Details
Main Author: Barends, Umr
Other Authors: Niesler, Thomas
Format: Thesis
Language:English
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613899340120064
access_status_str Open Access
author Barends, Umr
author2 Niesler, Thomas
author_browse Barends, Umr
Niesler, Thomas
author_facet Niesler, Thomas
Barends, Umr
author_sort Barends, Umr
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2024.
format Thesis
id oai:scholar.sun.ac.za:10019.1/131568
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:43:28.625Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/131568 Automatic orthography standardisation for under-resourced languages Barends, Umr Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD Thesis (MEng)--Stellenbosch University, 2024. ENGLISH ABSTRACT: This work addresses the normalization of the orthography of a severely under-resourced language, taking as a specific example the West African language known as Bambara. One aspect of the lack of resources for such languages is that spelling and orthographic conventions are not firmly established. This for example leads to variations in how speech is transcribed by mother-tongue speakers, which in turn leads to inconsistencies in the annotations found in a speech corpus. According to our investigation, there is no data available for the normalization of the Bambara language other than the very small corpus used in this work. To our knowledge, this is also the only corpus of transcribed Bambara speech. Normalizing the spelling of Bambara spellings is important for systems such as ASR or text to speech, where more consistent spellings equate to better performance of such language model based systems. The baseline method, known as anagram hashing, uses word anagrams and word n-grams to perform the normalization. These methods have been used by other researches to normalize historical text to modern spellings. In addition, we determine the performance that can be achieved by applying the machine learning methods: softmax regression, LSTM and bi-LSTM. Our experiments indicate that the neural network models out-performed the anagram hashing algorithm on the task of normalization of the Bambara orthography. We also found that word-level models performed better than character-level models. Among the machine learning models, the softmax regression model performed best at normalizing the Bambara orthography. We conclude that it is possible to perform automatic normalization of orthography using machine learning models that is superior to the current state-of-the -art, but that the small size of the traning set does not allow the recurrent architecture to surpass the performance of softmax regression. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Masters 2025-01-27T09:48:00Z 2025-01-27T09:48:00Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131568 en Stellenbosch University xiii, 71 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Bambara language--Orthography and spelling
Machine learning
Natural language processing (Computer science)
Endangered languages
UCTD
Barends, Umr
Automatic orthography standardisation for under-resourced languages
title Automatic orthography standardisation for under-resourced languages
title_full Automatic orthography standardisation for under-resourced languages
title_fullStr Automatic orthography standardisation for under-resourced languages
title_full_unstemmed Automatic orthography standardisation for under-resourced languages
title_short Automatic orthography standardisation for under-resourced languages
title_sort automatic orthography standardisation for under resourced languages
topic Bambara language--Orthography and spelling
Machine learning
Natural language processing (Computer science)
Endangered languages
UCTD
url https://scholar.sun.ac.za/handle/10019.1/131568
work_keys_str_mv AT barendsumr automaticorthographystandardisationforunderresourcedlanguages