Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Acoustic unit discovery with zero-resource applications

Thesis (PhD)--Stellenbosch University, 2024.

Saved in:

Bibliographic Details
Main Author:	Van Niekerk, Benjamin Lipman
Other Authors:	Kamper, H.
Format:	Thesis
Published:	Stellenbosch : Stellenbosch University 2025
Subjects:	Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613901973094400
access_status_str	Open Access
author	Van Niekerk, Benjamin Lipman
author2	Kamper, H.
author_browse	Kamper, H. Van Niekerk, Benjamin Lipman
author_facet	Kamper, H. Van Niekerk, Benjamin Lipman
author_sort	Van Niekerk, Benjamin Lipman
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (PhD)--Stellenbosch University, 2024.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/131957
institution	Stellenbosch University (South Africa)
last_indexed	2026-06-10T12:43:30.888Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2025
publishDateRange	2025
publishDateSort	2025
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/131957 Acoustic unit discovery with zero-resource applications Van Niekerk, Benjamin Lipman Kamper, H. Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD Thesis (PhD)--Stellenbosch University, 2024. ENGLISH ABSTRACT: The goal of acoustic unit discovery is to learn discrete, phone-like representations of speech without labeled data. The main challenge is the enormous variability of spoken language. How do we identify phonetic categories despite differences in context, intonation, speaking rate, and speaker identity? Recent progress on this problem has centered around self-supervised learning. However, there is still a gap between phonetic categories and the discovered acoustic units. To bridge this divide, Part I of this thesis explores methods to improve speaker invariance, reduce bitrates, and increase unit consistency. First, we propose two lightweight models that use vector quantization to learn discrete representations of speech. The first is an autoencoder, which converts speech to a sequence of acoustic units before trying to reconstruct the input waveform. The second model combines vector quantization with contrastive predictive coding. The idea is to learn representations of speech by identifying future acoustic units from a set of negative examples. In phone discrimination tests, our models achieve the best results on the ZeroSpeech 2020 Challenge. Next, we investigate how several self-supervised methods organize speaker and phonetic information. We find that t he p er-utterance a verage o f t he features captures speaker identity. Based on this analysis, we describe a normalization step to remove speaker details. This simple addition improves acoustic unit discovery and spoken-language modeling on the ZeroSpeech 2021 Challenge. Finally, we introduce two methods to divide speech into longer, phone-like segments: a greedy merging algorithm and a dynamic programming method that solves a dual problem with a duration penalty. We examine the trade-off between compression and performance, demonstrating competitive results at substantially lower bitrates than previous approaches. In Part II of this thesis, we test these improvements on downstream zero-resource tasks. In particular, we build on the discovered acoustic units to tackle spoken-term discovery, unsupervised voice conversion, and rhythm modeling. For spoken-term discovery, we find recurring words and phrases by applying pattern matching to acoustic units. On the ZeroSpeech 2017 Challenge, we outperform previous methods based on dynamic time warping. To improve voice conversion, we propose soft speech units, which represent a distribution over discrete assignments. By modeling uncertainty, soft units retain more content information, improving the naturalness and intelligibility of converted speech. Finally, we model the duration of acoustic units to approximate speaking rate and rhythm. By combining rhythm modeling and voice conversion, we improve speaker similarity and overall quality. Our results show that acoustic unit discovery has wide-ranging benefits for zeroresource speech processing. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. 2025-05-02T06:47:18Z 2025-05-02T06:47:18Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131957 Stellenbosch University xvi, 120 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD Van Niekerk, Benjamin Lipman Acoustic unit discovery with zero-resource applications
title	Acoustic unit discovery with zero-resource applications
title_full	Acoustic unit discovery with zero-resource applications
title_fullStr	Acoustic unit discovery with zero-resource applications
title_full_unstemmed	Acoustic unit discovery with zero-resource applications
title_short	Acoustic unit discovery with zero-resource applications
title_sort	acoustic unit discovery with zero resource applications
topic	Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD
url	https://scholar.sun.ac.za/handle/10019.1/131957
work_keys_str_mv	AT vanniekerkbenjaminlipman acousticunitdiscoverywithzeroresourceapplications

Full Text Available

Acoustic unit discovery with zero-resource applications

Similar Items