Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Spoken language modeling with discrete units

Visser, N. F. 2025. Spoken language modeling with discrete units. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/2d364ad0-2a7b-4a35-b6c6-bc0db74702ed

Saved in:
Bibliographic Details
Main Author: Visser, Nicolaas Floris
Other Authors: Kamper, Herman
Format: Thesis
Language:English
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613800036827136
access_status_str Open Access
author Visser, Nicolaas Floris
author2 Kamper, Herman
author_browse Kamper, Herman
Visser, Nicolaas Floris
author_facet Kamper, Herman
Visser, Nicolaas Floris
author_sort Visser, Nicolaas Floris
collection Thesis
dc_rights_str_mv Stellenbosch University
description Visser, N. F. 2025. Spoken language modeling with discrete units. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/2d364ad0-2a7b-4a35-b6c6-bc0db74702ed
format Thesis
id oai:scholar.sun.ac.za:10019.1/132325
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:41:53.663Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/132325 Spoken language modeling with discrete units Visser, Nicolaas Floris Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Speech processing systems Modeling languages (Computer science) Natural language processing (Computer science) UCTD Visser, N. F. 2025. Spoken language modeling with discrete units. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/2d364ad0-2a7b-4a35-b6c6-bc0db74702ed Thesis (MEng)--Stellenbosch University, 2025. ENGLISH ABSTRACT: Spoken language modeling aims to understand language directly from raw audio. Without relying on text, it opens new possibilities for developing speech systems in languages lacking extensive text corpora or transcribed speech. Despite recent advancements driven by the Zero Resource Speech Challenge, a significant performance gap remains between speech-based language models and their text-based counterparts when trained with equivalent data. This thesis addresses two primary goals to advance spoken language modeling using discrete units. Current speech units, obtained by clustering features from self-supervised learning models, are framewise units at around 20 ms intervals, whereas real phones have varying, typically longer, durations. Therefore, first, we investigate the role of duration-penalized units, which are vector-quantized representations with durations more aligned with phones. We assess the impact of these units on the performance and training efficiency of spoken language models. Our experiments demonstrate that while duration-penalized units reduce sequence lengths and enhance training efficiency, they do not yet significantly improve language modeling performance. This suggests that longer unit duration alone is insufficient to bridge the gap. Second, we diagnose factors within discrete units contributing to the gap. Through systematic experiments simulating various speech properties – including phoneme class confusion and duration variability – we find that phoneme class confusion is the primary factor limiting performance. This confusion hinders language models from effectively capturing linguistic structures inherent in speech. We also find that duration modeling remains important for bridging the gap in representing longer-range patterns, such as syntax. To address phoneme confusion, we adapt the language model to use hierarchical tokens derived from unsupervised clustering. By providing additional phoneme family context, we improve the model’s ability to cope with variability in speech data, showing modest improvements in language modeling tasks. Along the way, we develop a scalable version of duration-penalized dynamic programming by reformulating it as a weighted finite-state transducer, significantly reducing computational complexity. This enables efficient processing of large speech datasets and facilitates the modeling of long-range dependencies in spoken language models. In summary, this thesis contributes to advancing spoken language modeling by identifying key limitations and proposing methods to address them. Our findings suggest that future research should focus on effectively handling phoneme confusion – potentially through the use of soft units and adapted language modeling architectures – to further close the performance gap between speech-based and text-based language models. AFRIKAANSE OPSOMMING: Sekere taalmodelle poog om taal te verstaan deur slegs van audio-opnames gebruik te maak. Deur nie afhanklik te wees van teks nie, kan sulke modelle help om spraakstelsels te bou vir tale wat nie baie tekshulpronne tot beskikking het nie. Ten spyte van al die vooruitgang in hierdie veld, hoofsaaklik gedrewe deur die Zero Resource Speech Challenge, is daar steeds ’n groot gaping in die taalvermo¨e van hierdie slegs-oudio modelle teenoor teksmodelle. Hierdie tesis spreek twee primˆere doele aan. Spraak word gewoontlik opgebreek in klein, diskrete eenhede vir verwerking. Huidige stelsels kry hierdie eenhede vanaf modelle wat geleer is om spraak op te som in vektore teen ’n konstante interval soos 20 ms. Hierdie vektore word dan gediskretiseer om die spraakeenhede te vorm. Fone, die spraakeenhede wat taalkundiges gebruik, is egter baie langer as huidige spraakeenhede. Daarom ondersoek ons eerstens die rol van groter eenhede. Spesifiek kyk ons na durationpenalized spraakeenhede. Hierdie eenhede het ’n langer duur – nader aan die duur van fone. Ons ondersoek die impak van hierdie groter eenhede op modelle wat taal leer vanaf slegs oudio. Ons eksperimente wys dat groter spraakeenhede nog nie die taalvermo¨e van die modelle verbeter nie, maar dat dit wel die leerproses meer doeltreffend maak. Hierdie bevinding dui daarop dat dit nie genoeg is om slegs die duur van die spraakeenhede te verleng nie en dat daar ander faktore moet wees wat veroorsaak dat die spraakmodelle onderpresteer. Tweedens, soek ons na die spesifieke oorsaak van die prestasiegaping tussing diskrete spraakeenhede en foneme. Ons identifiseer eienskappe van die diskrete spraakeenhede wat moontlik die leerproses hinder. Ons voeg hierdie eienskappe dan kunsmatig by suiwer foneemeenhede en observeer hul effek. Ons bevind dat foneemverwarring in die eenhede die grootste kwaaddoener is. Die verwarring keer dat die taalmodelle effektief die strukture in die taal aanleer. Ons bevind ook dat die duur van die eenhede nogsteeds ’n belangrike rol speel wanneer die modelle langer-termyn strukture soos sintaks moet aanleer. In die proses, ontwikkel ons ’n skaalbare weergawe van duration-penalized dynamic programming deur dit te herformuleer as ’n eindige toestand outomate. Hierdie algoritme laat ons toe om groot spraakdatastelle effektief te kodeer met spraakeenhede. Ten slotte, dra ons by tot die bevordering van taalmodelle op oudio deur die belangrikste beperkings in huidige modelle te identifiseer en deur metodes voor te stel om dit aan te spreek. Ons bevindinge dui daarop dat toekomstige navorsing moet fokus om foneemverwarring effektief te verminder – moontlik deur probabilistiese eenhede te gebruik en die argitektuur van taalmodelle daarvoor aan te pas. Deur op hierdie te fokus, kan ons hopelik die kwaliteit van taalmodelle nader bring aan di´e van teksmodelle. Masters 2025-06-03T13:59:57Z 2025-06-03T13:59:57Z 2025-03 Thesis https://scholar.sun.ac.za/handle/10019.1/132325 en Stellenbosch University x, 101 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Speech processing systems
Modeling languages (Computer science)
Natural language processing (Computer science)
UCTD
Visser, Nicolaas Floris
Spoken language modeling with discrete units
title Spoken language modeling with discrete units
title_full Spoken language modeling with discrete units
title_fullStr Spoken language modeling with discrete units
title_full_unstemmed Spoken language modeling with discrete units
title_short Spoken language modeling with discrete units
title_sort spoken language modeling with discrete units
topic Speech processing systems
Modeling languages (Computer science)
Natural language processing (Computer science)
UCTD
url https://scholar.sun.ac.za/handle/10019.1/132325
work_keys_str_mv AT vissernicolaasfloris spokenlanguagemodelingwithdiscreteunits