Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

End-to-end automatic speech recognition of code-switched speech

Thesis (MEng)--Stellenbosch University, 2023.

Saved in:

Bibliographic Details
Main Author:	Tredoux, Larissa
Other Authors:	Niesler, Thomas
Format:	Thesis
Language:	en_ZA en_ZA
Published:	Stellenbosch : Stellenbosch University 2023
Subjects:	Automatic speech recognition > South Africa Code switching (Linguistics) > South Africa Bantu languages > Transcriptions > South Africa English language > Transcriptions > South Africa Corpora (Linguistics) > South Africa
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867614046558093312
access_status_str	Open Access
author	Tredoux, Larissa
author2	Niesler, Thomas
author_browse	Niesler, Thomas Tredoux, Larissa
author_facet	Niesler, Thomas Tredoux, Larissa
author_sort	Tredoux, Larissa
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2023.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/128940
institution	Stellenbosch University (South Africa)
language	en_ZA en_ZA
last_indexed	2026-06-10T12:45:48.703Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2023
publishDateRange	2023
publishDateSort	2023
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/128940 End-to-end automatic speech recognition of code-switched speech Tredoux, Larissa Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa Thesis (MEng)--Stellenbosch University, 2023. ENGLISH ABSTRACT: Automatic speech recognition (ASR) of code-switched speech is a relevant problem in many multilingual societies and has heretofore been accomplished by hybrid HMM-DNN architectures. However, developing these models requires a large amount of time and linguistic expertise. End-to-end ASR, though typically requiring thousands of hours of training data, is conceptually and linguistically simpler to implement. We therefore aim to present an end-to-end ASR system that is capable of producing transcriptions for code-switched speech in Bantu languages and English from the South African Corpus of Multilingual Code-switched Soap Opera Speech (Soap Opera corpus). This corpus consists of 21 hours of speech in four Bantu languages and English, whereas thousands of hours are typically required to train end-to-end ASR systems. To compensate for our low-resource scenario, we fine-tune and evaluate XLSR-53 wav2vec 2.0 models that have been pre-trained on 56 000 hours of data from 53 different languages. Fine-tuning and evaluation are performed on data from the Soap Opera corpus. We also pre-train and fine-tune wav2vec 2.0 models on the Soap Opera corpus, and incorporate data from the NCHLT corpus during pre-training in later experiments. Additionally, we reduce the size of the wav2vec 2.0 model in order to minimise the potential for overfitting on our small training set. The inclusion of more in-domain data and reduced model sizes both improve ASR results. However, XLSR-53 models fine-tuned on data from the Soap Opera corpus perform the best among our models. When a 5-gram language model is incorporated during decoding, the XLSR-53 models are able to achieve a word error rate of 44.1% and a character error rate of 18.2% on the five-lingual test set of the Soap Opera corpus. This word error rate compares well with results achieved by hybrid HMM-DNN models. We conclude that the wide variety and very large quantity of data used to train the XLSR-53 models is beneficial for ASR of code-switched speech. More generally, we conclude that it is desirable if not essential to pre-train end-to-end models on large quantities of data in order to achieve end-to-end ASR results that can compete well with hybrid HMM-DNN models. AFRIKAANSE OPSOMMING: Outomaties spraakherkenning (OSH) van kodewissel spraak is ’n relevante probleem in baie veeltalige samelewings. Tot dusver is dit bewerkstellig deur van hibriede HMM-DNN argitekture gebruik te maak. Dit verg egter groot hoeveelhede tyd en taalvaardighede om te ontwikkel. Begin-tot-end OSH, hoewel dit tipies duisende ure se opleidingsdata benodig, is konsepsueel en taalgewys eenvoudiger om te implementeer. Dus beoog ons om ’n begintot-end OSH stelsel, met die vermo¨e om transkripsies te lewer vir kodewissel spraak in Bantoetale en Engels, voor te stel. Die korpus wat gebruik is, is die South African Corpus of Multilingual Code-switched Speech (Sepie Korpus), en bevat slegs 21 uur se spraak in Bantoetale asook Engels, terwyl dit tipies duisende ure verg om begin-tot-end OSH op te lei. Om te kompenseer vir hierdie lae-hulpbron scenario, sal ons XLSR-53 wav2vec 2.0 modelle verfyn en evalueer wat vooraf opgelei is met 56 000 ure se data bestaande uit 53 verskillende tale. Die verfyning en evaluasie sal gedoen word met die Sepie Korpus. Verder sal wav2vec 2.0 modelle ook vooraf opgelei en verfyn word met die Sepie Korpus. Data van die NCHLT korpus word ook ge¨ınkorporeer gedurende die voorafopleiding in latere eksperimente. Die grootte van die wav2vec 2.0 model word ook verklein om oor-passing te voorkom met die beperkte oplei datastel. Die insluit van in-domein data, tesame met die verkleinde model, het gelei tot verbeterde OSH resultate. Die XLSR-53 modelle, verfyn met die Sepie Korpus data het egter die beste resultate behaal van al ons modelle. Wanneer ’n 5-gram taalmodel ge¨ınkorporeer word gedurende dekodering, is die XLSR-53 modelle in staat om ’n 44.1% woord-fout koers, en ’n karakter-fout koers van 18.2% te behaal met die vyf-taal Sepie Korpus toets datastel. Hierdie woord-fout koers vergelyk goed met die resultate behaal deur hibriede HMM-DNN modelle. Ons kom dus tot die gevolgtrekking dat die gebruik van die wye verskeidenheid en groot hoeveelhede data wat gebruik is in die voorafopleiding van die XLSR-53 modelle, voordelig is vir die OSH van kodewissel spraak. Verder, en meer algemeen, is ons gevolgtrekking ook dat dit wenslik, indien nie essensieel nie, is om modelle vooraf op te lei met groot hoeveelhede data ten einde begin-tot-end OSH resultate te behaal wat kan vergelyk met d´ıe van hibriede HMM-DNN modelle. Masters 2023-11-21T09:39:21Z 2024-01-08T16:34:54Z 2023-11-21T09:39:21Z 2024-01-08T16:34:54Z 2023-12 Thesis https://scholar.sun.ac.za/handle/10019.1/128940 en_ZA en_ZA Stellenbosch University xvii, 132 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa Tredoux, Larissa End-to-end automatic speech recognition of code-switched speech
title	End-to-end automatic speech recognition of code-switched speech
title_full	End-to-end automatic speech recognition of code-switched speech
title_fullStr	End-to-end automatic speech recognition of code-switched speech
title_full_unstemmed	End-to-end automatic speech recognition of code-switched speech
title_short	End-to-end automatic speech recognition of code-switched speech
title_sort	end to end automatic speech recognition of code switched speech
topic	Automatic speech recognition -- South Africa Code switching (Linguistics) -- South Africa Bantu languages -- Transcriptions -- South Africa English language -- Transcriptions -- South Africa Corpora (Linguistics) -- South Africa
url	https://scholar.sun.ac.za/handle/10019.1/128940
work_keys_str_mv	AT tredouxlarissa endtoendautomaticspeechrecognitionofcodeswitchedspeech

Full Text Available

End-to-end automatic speech recognition of code-switched speech

Similar Items