Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Automatic orthography standardisation for under-resourced languages

Thesis (MEng)--Stellenbosch University, 2024.

Saved in:

Bibliographic Details
Main Author:	Barends, Umr
Other Authors:	Niesler, Thomas
Format:	Thesis
Language:	English
Published:	Stellenbosch : Stellenbosch University 2025
Subjects:	Bambara language > Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613899340120064
access_status_str	Open Access
author	Barends, Umr
author2	Niesler, Thomas
author_browse	Barends, Umr Niesler, Thomas
author_facet	Niesler, Thomas Barends, Umr
author_sort	Barends, Umr
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2024.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/131568
institution	Stellenbosch University (South Africa)
language	English
last_indexed	2026-06-10T12:43:28.625Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2025
publishDateRange	2025
publishDateSort	2025
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/131568 Automatic orthography standardisation for under-resourced languages Barends, Umr Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD Thesis (MEng)--Stellenbosch University, 2024. ENGLISH ABSTRACT: This work addresses the normalization of the orthography of a severely under-resourced language, taking as a specific example the West African language known as Bambara. One aspect of the lack of resources for such languages is that spelling and orthographic conventions are not firmly established. This for example leads to variations in how speech is transcribed by mother-tongue speakers, which in turn leads to inconsistencies in the annotations found in a speech corpus. According to our investigation, there is no data available for the normalization of the Bambara language other than the very small corpus used in this work. To our knowledge, this is also the only corpus of transcribed Bambara speech. Normalizing the spelling of Bambara spellings is important for systems such as ASR or text to speech, where more consistent spellings equate to better performance of such language model based systems. The baseline method, known as anagram hashing, uses word anagrams and word n-grams to perform the normalization. These methods have been used by other researches to normalize historical text to modern spellings. In addition, we determine the performance that can be achieved by applying the machine learning methods: softmax regression, LSTM and bi-LSTM. Our experiments indicate that the neural network models out-performed the anagram hashing algorithm on the task of normalization of the Bambara orthography. We also found that word-level models performed better than character-level models. Among the machine learning models, the softmax regression model performed best at normalizing the Bambara orthography. We conclude that it is possible to perform automatic normalization of orthography using machine learning models that is superior to the current state-of-the -art, but that the small size of the traning set does not allow the recurrent architecture to surpass the performance of softmax regression. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Masters 2025-01-27T09:48:00Z 2025-01-27T09:48:00Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131568 en Stellenbosch University xiii, 71 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD Barends, Umr Automatic orthography standardisation for under-resourced languages
title	Automatic orthography standardisation for under-resourced languages
title_full	Automatic orthography standardisation for under-resourced languages
title_fullStr	Automatic orthography standardisation for under-resourced languages
title_full_unstemmed	Automatic orthography standardisation for under-resourced languages
title_short	Automatic orthography standardisation for under-resourced languages
title_sort	automatic orthography standardisation for under resourced languages
topic	Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD
url	https://scholar.sun.ac.za/handle/10019.1/131568
work_keys_str_mv	AT barendsumr automaticorthographystandardisationforunderresourcedlanguages

Full Text Available

Automatic orthography standardisation for under-resourced languages

Similar Items