Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (PhD)--Stellenbosch University, 2024.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Published: |
Stellenbosch : Stellenbosch University
2025
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613901973094400 |
|---|---|
| access_status_str | Open Access |
| author | Van Niekerk, Benjamin Lipman |
| author2 | Kamper, H. |
| author_browse | Kamper, H. Van Niekerk, Benjamin Lipman |
| author_facet | Kamper, H. Van Niekerk, Benjamin Lipman |
| author_sort | Van Niekerk, Benjamin Lipman |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description |
Thesis (PhD)--Stellenbosch University, 2024. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/131957 |
| institution | Stellenbosch University (South Africa) |
| last_indexed | 2026-06-10T12:43:30.888Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2025 |
| publishDateRange | 2025 |
| publishDateSort | 2025 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/131957 Acoustic unit discovery with zero-resource applications Van Niekerk, Benjamin Lipman Kamper, H. Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD Thesis (PhD)--Stellenbosch University, 2024. ENGLISH ABSTRACT: The goal of acoustic unit discovery is to learn discrete, phone-like representations of speech without labeled data. The main challenge is the enormous variability of spoken language. How do we identify phonetic categories despite differences in context, intonation, speaking rate, and speaker identity? Recent progress on this problem has centered around self-supervised learning. However, there is still a gap between phonetic categories and the discovered acoustic units. To bridge this divide, Part I of this thesis explores methods to improve speaker invariance, reduce bitrates, and increase unit consistency. First, we propose two lightweight models that use vector quantization to learn discrete representations of speech. The first is an autoencoder, which converts speech to a sequence of acoustic units before trying to reconstruct the input waveform. The second model combines vector quantization with contrastive predictive coding. The idea is to learn representations of speech by identifying future acoustic units from a set of negative examples. In phone discrimination tests, our models achieve the best results on the ZeroSpeech 2020 Challenge. Next, we investigate how several self-supervised methods organize speaker and phonetic information. We find that t he p er-utterance a verage o f t he features captures speaker identity. Based on this analysis, we describe a normalization step to remove speaker details. This simple addition improves acoustic unit discovery and spoken-language modeling on the ZeroSpeech 2021 Challenge. Finally, we introduce two methods to divide speech into longer, phone-like segments: a greedy merging algorithm and a dynamic programming method that solves a dual problem with a duration penalty. We examine the trade-off between compression and performance, demonstrating competitive results at substantially lower bitrates than previous approaches. In Part II of this thesis, we test these improvements on downstream zero-resource tasks. In particular, we build on the discovered acoustic units to tackle spoken-term discovery, unsupervised voice conversion, and rhythm modeling. For spoken-term discovery, we find recurring words and phrases by applying pattern matching to acoustic units. On the ZeroSpeech 2017 Challenge, we outperform previous methods based on dynamic time warping. To improve voice conversion, we propose soft speech units, which represent a distribution over discrete assignments. By modeling uncertainty, soft units retain more content information, improving the naturalness and intelligibility of converted speech. Finally, we model the duration of acoustic units to approximate speaking rate and rhythm. By combining rhythm modeling and voice conversion, we improve speaker similarity and overall quality. Our results show that acoustic unit discovery has wide-ranging benefits for zeroresource speech processing. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. 2025-05-02T06:47:18Z 2025-05-02T06:47:18Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131957 Stellenbosch University xvi, 120 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD Van Niekerk, Benjamin Lipman Acoustic unit discovery with zero-resource applications |
| title | Acoustic unit discovery with zero-resource applications |
| title_full | Acoustic unit discovery with zero-resource applications |
| title_fullStr | Acoustic unit discovery with zero-resource applications |
| title_full_unstemmed | Acoustic unit discovery with zero-resource applications |
| title_short | Acoustic unit discovery with zero-resource applications |
| title_sort | acoustic unit discovery with zero resource applications |
| topic | Automatic speech recognition Phonetics, Acoustic Computer-assisted instruction UCTD |
| url | https://scholar.sun.ac.za/handle/10019.1/131957 |
| work_keys_str_mv | AT vanniekerkbenjaminlipman acousticunitdiscoverywithzeroresourceapplications |