Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng)--Stellenbosch University, 2023.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | en_ZA en_ZA |
| Published: |
Stellenbosch : Stellenbosch University
2023
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867614046558093312 |
|---|---|
| access_status_str | Open Access |
| author | Tredoux, Larissa |
| author2 | Niesler, Thomas |
| author_browse | Niesler, Thomas Tredoux, Larissa |
| author_facet | Niesler, Thomas Tredoux, Larissa |
| author_sort | Tredoux, Larissa |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MEng)--Stellenbosch University, 2023. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/128940 |
| institution | Stellenbosch University (South Africa) |
| language | en_ZA en_ZA |
| last_indexed | 2026-06-10T12:45:48.703Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2023 |
| publishDateRange | 2023 |
| publishDateSort | 2023 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/128940 End-to-end automatic speech recognition of code-switched speech Tredoux, Larissa Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa Thesis (MEng)--Stellenbosch University, 2023. ENGLISH ABSTRACT: Automatic speech recognition (ASR) of code-switched speech is a relevant problem in many multilingual societies and has heretofore been accomplished by hybrid HMM-DNN architectures. However, developing these models requires a large amount of time and linguistic expertise. End-to-end ASR, though typically requiring thousands of hours of training data, is conceptually and linguistically simpler to implement. We therefore aim to present an end-to-end ASR system that is capable of producing transcriptions for code-switched speech in Bantu languages and English from the South African Corpus of Multilingual Code-switched Soap Opera Speech (Soap Opera corpus). This corpus consists of 21 hours of speech in four Bantu languages and English, whereas thousands of hours are typically required to train end-to-end ASR systems. To compensate for our low-resource scenario, we fine-tune and evaluate XLSR-53 wav2vec 2.0 models that have been pre-trained on 56 000 hours of data from 53 different languages. Fine-tuning and evaluation are performed on data from the Soap Opera corpus. We also pre-train and fine-tune wav2vec 2.0 models on the Soap Opera corpus, and incorporate data from the NCHLT corpus during pre-training in later experiments. Additionally, we reduce the size of the wav2vec 2.0 model in order to minimise the potential for overfitting on our small training set. The inclusion of more in-domain data and reduced model sizes both improve ASR results. However, XLSR-53 models fine-tuned on data from the Soap Opera corpus perform the best among our models. When a 5-gram language model is incorporated during decoding, the XLSR-53 models are able to achieve a word error rate of 44.1% and a character error rate of 18.2% on the five-lingual test set of the Soap Opera corpus. This word error rate compares well with results achieved by hybrid HMM-DNN models. We conclude that the wide variety and very large quantity of data used to train the XLSR-53 models is beneficial for ASR of code-switched speech. More generally, we conclude that it is desirable if not essential to pre-train end-to-end models on large quantities of data in order to achieve end-to-end ASR results that can compete well with hybrid HMM-DNN models. AFRIKAANSE OPSOMMING: Outomaties spraakherkenning (OSH) van kodewissel spraak is ’n relevante probleem in baie veeltalige samelewings. Tot dusver is dit bewerkstellig deur van hibriede HMM-DNN argitekture gebruik te maak. Dit verg egter groot hoeveelhede tyd en taalvaardighede om te ontwikkel. Begin-tot-end OSH, hoewel dit tipies duisende ure se opleidingsdata benodig, is konsepsueel en taalgewys eenvoudiger om te implementeer. Dus beoog ons om ’n begintot-end OSH stelsel, met die vermo¨e om transkripsies te lewer vir kodewissel spraak in Bantoetale en Engels, voor te stel. Die korpus wat gebruik is, is die South African Corpus of Multilingual Code-switched Speech (Sepie Korpus), en bevat slegs 21 uur se spraak in Bantoetale asook Engels, terwyl dit tipies duisende ure verg om begin-tot-end OSH op te lei. Om te kompenseer vir hierdie lae-hulpbron scenario, sal ons XLSR-53 wav2vec 2.0 modelle verfyn en evalueer wat vooraf opgelei is met 56 000 ure se data bestaande uit 53 verskillende tale. Die verfyning en evaluasie sal gedoen word met die Sepie Korpus. Verder sal wav2vec 2.0 modelle ook vooraf opgelei en verfyn word met die Sepie Korpus. Data van die NCHLT korpus word ook ge¨ınkorporeer gedurende die voorafopleiding in latere eksperimente. Die grootte van die wav2vec 2.0 model word ook verklein om oor-passing te voorkom met die beperkte oplei datastel. Die insluit van in-domein data, tesame met die verkleinde model, het gelei tot verbeterde OSH resultate. Die XLSR-53 modelle, verfyn met die Sepie Korpus data het egter die beste resultate behaal van al ons modelle. Wanneer ’n 5-gram taalmodel ge¨ınkorporeer word gedurende dekodering, is die XLSR-53 modelle in staat om ’n 44.1% woord-fout koers, en ’n karakter-fout koers van 18.2% te behaal met die vyf-taal Sepie Korpus toets datastel. Hierdie woord-fout koers vergelyk goed met die resultate behaal deur hibriede HMM-DNN modelle. Ons kom dus tot die gevolgtrekking dat die gebruik van die wye verskeidenheid en groot hoeveelhede data wat gebruik is in die voorafopleiding van die XLSR-53 modelle, voordelig is vir die OSH van kodewissel spraak. Verder, en meer algemeen, is ons gevolgtrekking ook dat dit wenslik, indien nie essensieel nie, is om modelle vooraf op te lei met groot hoeveelhede data ten einde begin-tot-end OSH resultate te behaal wat kan vergelyk met d´ıe van hibriede HMM-DNN modelle. Masters 2023-11-21T09:39:21Z 2024-01-08T16:34:54Z 2023-11-21T09:39:21Z 2024-01-08T16:34:54Z 2023-12 Thesis https://scholar.sun.ac.za/handle/10019.1/128940 en_ZA en_ZA Stellenbosch University xvii, 132 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa Tredoux, Larissa End-to-end automatic speech recognition of code-switched speech |
| title | End-to-end automatic speech recognition of code-switched speech |
| title_full | End-to-end automatic speech recognition of code-switched speech |
| title_fullStr | End-to-end automatic speech recognition of code-switched speech |
| title_full_unstemmed | End-to-end automatic speech recognition of code-switched speech |
| title_short | End-to-end automatic speech recognition of code-switched speech |
| title_sort | end to end automatic speech recognition of code switched speech |
| topic | Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa |
| url | https://scholar.sun.ac.za/handle/10019.1/128940 |
| work_keys_str_mv | AT tredouxlarissa endtoendautomaticspeechrecognitionofcodeswitchedspeech |