Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Acoustic unit discovery with zero-resource applications

Thesis (PhD)--Stellenbosch University, 2024.

Saved in:
Bibliographic Details
Main Author: Van Niekerk, Benjamin Lipman
Other Authors: Kamper, H.
Format: Thesis
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613901973094400
access_status_str Open Access
author Van Niekerk, Benjamin Lipman
author2 Kamper, H.
author_browse Kamper, H.
Van Niekerk, Benjamin Lipman
author_facet Kamper, H.
Van Niekerk, Benjamin Lipman
author_sort Van Niekerk, Benjamin Lipman
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (PhD)--Stellenbosch University, 2024.
format Thesis
id oai:scholar.sun.ac.za:10019.1/131957
institution Stellenbosch University (South Africa)
last_indexed 2026-06-10T12:43:30.888Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/131957 Acoustic unit discovery with zero-resource applications Van Niekerk, Benjamin Lipman Kamper, H. Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD Thesis (PhD)--Stellenbosch University, 2024. ENGLISH ABSTRACT: The goal of acoustic unit discovery is to learn discrete, phone-like representations of speech without labeled data. The main challenge is the enormous variability of spoken language. How do we identify phonetic categories despite differences in context, intonation, speaking rate, and speaker identity? Recent progress on this problem has centered around self-supervised learning. However, there is still a gap between phonetic categories and the discovered acoustic units. To bridge this divide, Part I of this thesis explores methods to improve speaker invariance, reduce bitrates, and increase unit consistency. First, we propose two lightweight models that use vector quantization to learn discrete representations of speech. The first is an autoencoder, which converts speech to a sequence of acoustic units before trying to reconstruct the input waveform. The second model combines vector quantization with contrastive predictive coding. The idea is to learn representations of speech by identifying future acoustic units from a set of negative examples. In phone discrimination tests, our models achieve the best results on the ZeroSpeech 2020 Challenge. Next, we investigate how several self-supervised methods organize speaker and phonetic information. We find that t he p er-utterance a verage o f t he features captures speaker identity. Based on this analysis, we describe a normalization step to remove speaker details. This simple addition improves acoustic unit discovery and spoken-language modeling on the ZeroSpeech 2021 Challenge. Finally, we introduce two methods to divide speech into longer, phone-like segments: a greedy merging algorithm and a dynamic programming method that solves a dual problem with a duration penalty. We examine the trade-off between compression and performance, demonstrating competitive results at substantially lower bitrates than previous approaches. In Part II of this thesis, we test these improvements on downstream zero-resource tasks. In particular, we build on the discovered acoustic units to tackle spoken-term discovery, unsupervised voice conversion, and rhythm modeling. For spoken-term discovery, we find recurring words and phrases by applying pattern matching to acoustic units. On the ZeroSpeech 2017 Challenge, we outperform previous methods based on dynamic time warping. To improve voice conversion, we propose soft speech units, which represent a distribution over discrete assignments. By modeling uncertainty, soft units retain more content information, improving the naturalness and intelligibility of converted speech. Finally, we model the duration of acoustic units to approximate speaking rate and rhythm. By combining rhythm modeling and voice conversion, we improve speaker similarity and overall quality. Our results show that acoustic unit discovery has wide-ranging benefits for zeroresource speech processing. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. 2025-05-02T06:47:18Z 2025-05-02T06:47:18Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131957 Stellenbosch University xvi, 120 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Automatic speech recognition
Phonetics, Acoustic
Computer-assisted instruction
UCTD
Van Niekerk, Benjamin Lipman
Acoustic unit discovery with zero-resource applications
title Acoustic unit discovery with zero-resource applications
title_full Acoustic unit discovery with zero-resource applications
title_fullStr Acoustic unit discovery with zero-resource applications
title_full_unstemmed Acoustic unit discovery with zero-resource applications
title_short Acoustic unit discovery with zero-resource applications
title_sort acoustic unit discovery with zero resource applications
topic Automatic speech recognition
Phonetics, Acoustic
Computer-assisted instruction
UCTD
url https://scholar.sun.ac.za/handle/10019.1/131957
work_keys_str_mv AT vanniekerkbenjaminlipman acousticunitdiscoverywithzeroresourceapplications