Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng)--Stellenbosch University, 2024.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Stellenbosch : Stellenbosch University
2025
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613899340120064 |
|---|---|
| access_status_str | Open Access |
| author | Barends, Umr |
| author2 | Niesler, Thomas |
| author_browse | Barends, Umr Niesler, Thomas |
| author_facet | Niesler, Thomas Barends, Umr |
| author_sort | Barends, Umr |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description |
Thesis (MEng)--Stellenbosch University, 2024. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/131568 |
| institution | Stellenbosch University (South Africa) |
| language | English |
| last_indexed | 2026-06-10T12:43:28.625Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2025 |
| publishDateRange | 2025 |
| publishDateSort | 2025 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/131568 Automatic orthography standardisation for under-resourced languages Barends, Umr Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD Thesis (MEng)--Stellenbosch University, 2024. ENGLISH ABSTRACT: This work addresses the normalization of the orthography of a severely under-resourced language, taking as a specific example the West African language known as Bambara. One aspect of the lack of resources for such languages is that spelling and orthographic conventions are not firmly established. This for example leads to variations in how speech is transcribed by mother-tongue speakers, which in turn leads to inconsistencies in the annotations found in a speech corpus. According to our investigation, there is no data available for the normalization of the Bambara language other than the very small corpus used in this work. To our knowledge, this is also the only corpus of transcribed Bambara speech. Normalizing the spelling of Bambara spellings is important for systems such as ASR or text to speech, where more consistent spellings equate to better performance of such language model based systems. The baseline method, known as anagram hashing, uses word anagrams and word n-grams to perform the normalization. These methods have been used by other researches to normalize historical text to modern spellings. In addition, we determine the performance that can be achieved by applying the machine learning methods: softmax regression, LSTM and bi-LSTM. Our experiments indicate that the neural network models out-performed the anagram hashing algorithm on the task of normalization of the Bambara orthography. We also found that word-level models performed better than character-level models. Among the machine learning models, the softmax regression model performed best at normalizing the Bambara orthography. We conclude that it is possible to perform automatic normalization of orthography using machine learning models that is superior to the current state-of-the -art, but that the small size of the traning set does not allow the recurrent architecture to surpass the performance of softmax regression. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Masters 2025-01-27T09:48:00Z 2025-01-27T09:48:00Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131568 en Stellenbosch University xiii, 71 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD Barends, Umr Automatic orthography standardisation for under-resourced languages |
| title | Automatic orthography standardisation for under-resourced languages |
| title_full | Automatic orthography standardisation for under-resourced languages |
| title_fullStr | Automatic orthography standardisation for under-resourced languages |
| title_full_unstemmed | Automatic orthography standardisation for under-resourced languages |
| title_short | Automatic orthography standardisation for under-resourced languages |
| title_sort | automatic orthography standardisation for under resourced languages |
| topic | Bambara language--Orthography and spelling Machine learning Natural language processing (Computer science) Endangered languages UCTD |
| url | https://scholar.sun.ac.za/handle/10019.1/131568 |
| work_keys_str_mv | AT barendsumr automaticorthographystandardisationforunderresourcedlanguages |