Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add...
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Department of Statistical Sciences
2023
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613249859485696 |
|---|---|
| access_status_str | Open Access |
| author | Houston, Charles |
| author2 | Britz, Stefan S |
| author_browse | Britz, Stefan S Houston, Charles |
| author_facet | Britz, Stefan S Houston, Charles |
| author_sort | Houston, Charles |
| collection | Thesis |
| description | Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. |
| format | Thesis |
| id | oai:open.uct.ac.za:11427/37267 |
| institution | University of Cape Town (South Africa) |
| language | eng |
| last_indexed | 2026-06-10T12:33:08.525Z |
| license_str | Not specified — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository |
| publishDate | 2023 |
| publishDateRange | 2023 |
| publishDateSort | 2023 |
| publisher | Department of Statistical Sciences |
| publisherStr | Department of Statistical Sciences |
| record_format | dspace |
| source_str | UCTD — University of Cape Town Open Access Repository |
| spelling | oai:open.uct.ac.za:11427/37267 Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech Houston, Charles Britz, Stefan S Durbach, Ian Statistical Sciences Despite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model. 2023-03-06T10:16:35Z 2023-03-06T10:16:35Z 2022 2023-02-20T12:56:38Z Master Thesis Masters MSc http://hdl.handle.net/11427/37267 eng application/pdf Department of Statistical Sciences Faculty of Science |
| spellingShingle | Statistical Sciences Houston, Charles Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| thesis_degree_str | Master's |
| title | Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| title_full | Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| title_fullStr | Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| title_full_unstemmed | Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| title_short | Adapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech |
| title_sort | adapting large scale speaker independent automatic speech recognition to dysarthric speech |
| topic | Statistical Sciences |
| url | http://hdl.handle.net/11427/37267 |
| work_keys_str_mv | AT houstoncharles adaptinglargescalespeakerindependentautomaticspeechrecognitiontodysarthricspeech |