Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Disentangled representations in speech processing applications

Thesis (PhD)--Stellenbosch University, 2024.

Saved in:
Bibliographic Details
Main Author: Baas, Matthew
Other Authors: Kamper, Herman
Format: Thesis
Language:English
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613865612673024
access_status_str Open Access
author Baas, Matthew
author2 Kamper, Herman
author_browse Baas, Matthew
Kamper, Herman
author_facet Kamper, Herman
Baas, Matthew
author_sort Baas, Matthew
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (PhD)--Stellenbosch University, 2024.
format Thesis
id oai:scholar.sun.ac.za:10019.1/131559
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:42:55.322Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/131559 Disentangled representations in speech processing applications Baas, Matthew Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical & Electronic Engineering. Speech processing systems Neural networks (Computer science) Automatic speech recognition UCTD Thesis (PhD)--Stellenbosch University, 2024. ENGLISH ABSTRACT: A central goal in systems that produce speech is to easily control high-level characteristics of the speech while retaining naturalness. If we had such a system, it would enable a range of speech processing applications, from more realistic speech assistants, to assistive applications for those with speech disfluencies. There is, however, a tension: most of the best existing speech processing methods rely on the explicit disentanglement of speech, but we know that humans process speech as a purely continuous signal without explicit factorizations. The explicit nature of the former means that they typically identify a handful of meaningful characteristics of speech beforehand and then carefully design systems to measure, disentangle, and modify these characteristics, recombining them to produce output speech. For example, speech synthesis systems might design modules to model speaker identity, prosody, and emotion separately from one another. This has a key limitation: explicitly identifying the discrete set of aspects that comprises speech is a contested and open-ended task, with some aspects intrinsically tied to one another, prohibiting explicit demarcation (e.g. phonemic identity is tied to timing in certain languages). We observe the recent progress in other domains whereby injecting knowledge from domain experts when designing a model is inferior to more general methods that make fewer assumptions about the data and task. Instead, better performance is obtained when the disentanglement is learnt from the data rather than externally imposed. This leads to our primary aim: to investigate how we might bridge this tension between explicit and implicit disentanglement in speech processing. Our main claim is that continuous methods that implicitly learn the various aspects of speech can yield improved generalization compared to discrete methods with explicit demarcations of speech characteristics. However, we also observe that discrete methods offer easier training and scaling over purely continuous ones. This thesis is divided into four parts. The first part investigates disentanglement in unconditional speech synthesis. We propose a new generative adversarial network (GAN) based approach for unconditional speech synthesis that generates speech purely from a continuous latent space without explicit conditioning. We introduce new techniques to optimize our model for learning a disentangled latent space whereby linear subspaces correspond to meaningful characteristics of speech. In experiments in a constrained setting of limited-vocabulary speech, we confirm that the learnt latent space is more disentangled than existing methods. And, critically, we demonstrate a key benefit of learning a disentangled, continuous latent space by showing that our GAN can perform several speech processing tasks unseen during training. Specifically, we show that – with simple linear operations in the latent space – we can perform voice conversion, speech enhancement, speaker verification, and keyword classification, in some cases to a similar level to task-specific baselines. This investigation shows that learning good continuous latent spaces enables generalization to unseen tasks. The second part uses the insights of the first investigation to develop a model for a hard, practical task: any-to-any voice conversion. The result is k-nearest neighbors voice conversion (kNN-VC), a method which uses the linearly disentangled nature of features produced by existing speech representation models to perform voice conversion. As the name suggests, it is a method whereby provided speech is converted to sound like a desired target speaker by simply replacing each feature from the source with the mean of the k-nearest neighbors from the target. Despite its simple nature, compared to top performing existing methods, kNN-VC achieves a new state-of-the-art for voice conversion. Not using discrete speaker labels enables the model to interpolate between voices, perform inference on unseen languages, and even be adapted to sample new speakers from a text prompt. The third part of this thesis attempts to tackle the tension from the other side: can we give discrete methods similar benefits to continuous methods? We investigate this by attempting to incorporate disentangled continuous units for a task that necessitates discrete outputs: speech recognition. Specifically, we propose the first discrete diffusion model for speech recognition. Using the disentangled features from the prior investigation as conditioning, we iteratively refine a multinomial distribution over characters until we arrive at a final coherent transcript. We demonstrate comparable performance to existing state-of-the-art contrastive models on the LibriSpeech speech recognition benchmark. Compared to the dynamic programming algorithms necessary to decode from contrastive models, the output produced by our discrete diffusion model is readily interpretable. It also allows for the extensions afforded by various diffusion decoding techniques. This shows that adapting continuous domain methods (denoising diffusion) and disentangled continuous features for discrete domains yields certain benefits. In the fourth part, we demonstrate the practical usefulness of the knowledge obtained from the first three parts by applying the lessons and methods to improve several existing speech processing tasks. Concretely, we demonstrate how voice conversion applied to unseen languages can be used to improve speech recognition in very low-resource settings. We also investigate how voice conversion can aid those with speech disfluencies by correcting stuttered speech, and test its generalization limits by investigating human-to-instrument conversions. In summary, this thesis shows that the learnt disentanglements provided by continuous speech processing models allow for simpler generalization and control, but that discrete/explicit disentanglement methods still retain benefits in terms of ease of training and scalability. The question remains open as to what the best way is to combine the strengths of both approaches – or if it is even truly possible. AFRIKAANSE OPSOMMING: Ligte elektriese voertuie (LEVs) het prominensie verwerf as ’n aantreklike vervoermiddel oor uiteenlopende rytoestande. Die ontwikkeling van hierdie voertuie word gedryf deur die strewe na verbeterde doeltreffendheid, verminderde volume en kostedoeltreffendheid. Hierdie groeiende belangstelling in LEVs het die verkenning van verskeie dryfkragopsies aangespoor om aan hierdie ontwikkelende vereistes te voldoen. Hierdie proefskrif is gewy aan die ondersoek en ontwerp van ’n PM-sinchroniese motor wat spesifiek vir ligte elektriese voertuigtoepassings aangepas is, en om ’n gemak van integrasie met die ander komponente van die dryftrein te verseker. Die integrale komponente van die voorgestelde ge¨ıntegreerde motoraandrywing sluit die magnetiese rat, elektriese motor en elektriese aandryfstelsel in, met die elektriese motor wat as die hart van die stelsel dien. Baie soorte ontwerpkeuses is beskikbaar vir die elektriese motor, en probleme word oor die algemeen gevind in die ontwerpoptimalisering vir vastraptoepassing. Daarom draai die fokus van hierdie proefskrif om hierdie ontwerpkeuses, en die korrekte implementering van ’n optimeringsmetode om ’n koste-effektiewe ontwerp te vind. Verder het navorsing getoon dat sekere metodes bestaan om ho¨espoedreeks vir traksiemotors te bereik. Terwyl bestaande navorsing ho¨espoed traksiemotortoepassings beklemtoon deur V-vormige rotorontwerpe en ingewikkelde beheerstrategie¨e te gebruik, pleit hierdie werk vir ’n meer koste-effektiewe oplossing deur die gebruik van ’n SPM motor wat in staat is om optimale snelhede te bereik vir die spesifieke toepassing. ’n Deeglike verkenning van verskeie topologie¨e, insluitend rotorkeuses en kronkeluitlegte, vir ligte elektriese voertuie word onderneem en sistematies vergelyk. Die bevindinge toon dat ’n kostedoeltreffende SPM-motor met ’n verspreide wikkeling en SPM-rotorkeuse ’n uitgebreide ho¨espoedreeks kan bereik sonder om die kostedoeltreffendheid in te boet. Hierdie ontwerpkeuse word verder ondersoek deur unieke optimaliserings wat eindigeelementanalise en ’n dryfsiklusbenadering gebruik. ’n Nuwe vloedkarteringmetode word ontwikkel, wat lei tot die identifikasie van verlengde statortande as ’n strategiese verbetering. Hierdie ontwerpmodifikasie verbeter sinchroniese induktansie, wat gevolglik veldverswakkingsvermo¨ens en termiese doeltreffendheid verbeter. Die hoogtepunt van hierdie werk beklemtoon die doeltreffendheid van die optimaliseringstrategie, wat demonstreer hoe ’n uitgebreide veldverswakking bereik kan word deur die gleuftandgebied te vergroot. Hierdie navorsing dra waardevolle insigte by in die bereiking van ’n balans tussen kostedoeltreffendheid en ’n ho¨eprestasie-ontwerp,en werp lig op die potensiaal van oppervlakgemonteerde PM’s vir elektriese motors om binne ligte elektriese voertuie ge¨ımplementeer te word. Doctoral 2025-01-27T08:09:58Z 2025-01-27T08:09:58Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131559 en Stellenbosch University xxiii, 127 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Speech processing systems
Neural networks (Computer science)
Automatic speech recognition
UCTD
Baas, Matthew
Disentangled representations in speech processing applications
title Disentangled representations in speech processing applications
title_full Disentangled representations in speech processing applications
title_fullStr Disentangled representations in speech processing applications
title_full_unstemmed Disentangled representations in speech processing applications
title_short Disentangled representations in speech processing applications
title_sort disentangled representations in speech processing applications
topic Speech processing systems
Neural networks (Computer science)
Automatic speech recognition
UCTD
url https://scholar.sun.ac.za/handle/10019.1/131559
work_keys_str_mv AT baasmatthew disentangledrepresentationsinspeechprocessingapplications