Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Visually grounded speech models for low-resource languages and cognitive modelling

Thesis (PhD)--Stellenbosch University, 2024.

Saved in:
Bibliographic Details
Main Author: Nortje, Aletta Susanna Elizabeth
Other Authors: Kamper, Herman
Format: Thesis
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613737683255297
access_status_str Open Access
author Nortje, Aletta Susanna Elizabeth
author2 Kamper, Herman
author_browse Kamper, Herman
Nortje, Aletta Susanna Elizabeth
author_facet Kamper, Herman
Nortje, Aletta Susanna Elizabeth
author_sort Nortje, Aletta Susanna Elizabeth
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (PhD)--Stellenbosch University, 2024.
format Thesis
id oai:scholar.sun.ac.za:10019.1/131867
institution Stellenbosch University (South Africa)
last_indexed 2026-06-10T12:40:54.381Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/131867 Visually grounded speech models for low-resource languages and cognitive modelling Nortje, Aletta Susanna Elizabeth Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Speech processing systems Language acquisition -- Computer simulation Human information processing Cognition -- Computer simulation UCTD Thesis (PhD)--Stellenbosch University, 2024. ENGLISH ABSTRACT: Visually grounded speech models (VGS) learn from unlabelled speech paired with images. Such models can be valuable to develop speech applications for low-resource languages lacking transcribed data, and understanding how humans acquire language since children learn speech from multimodal cues. This dissertation makes contributions to both of these areas. In the first part of this dissertation, we consider two research questions about using VGS models in low-resource language applications. The first research question asks: can we get a VGS model that can detect and localise a keyword depicted by an image within speech? For this, we propose a new task called visually prompted keyword localisation (VPKL). Here, an image depicting a keyword query should be detected in spoken utterances. A detected query should be localised within the utterance. To do VPKL, we modify a common VGS modelling approach using an acoustic and a vision network connected with a multimodal attention mechanism. On an artificial low-resource language, English, we find that using an ideal tagger to get training pairs outperforms a previous visual Bag-of-Words (BOW) model locating written keywords in spoken utterances. An actual visual tagger results in lower scores than the written keyword baseline. To do VPKL for a real low-resource language, we consider few-shot learning before returning to this problem. In the second research question, we ask if we can get a VGS model to learn words using only a few word-image pairs. We use an architecture similar to the VPKL model’s and combine it with a few-shot learning approach that can learn new classes from fewer natural word-image pairs. Using the few given word-image example pairs, new unsupervised word-image training pairs are mined from large unlabelled speech and image sets. Our approach outperforms an existing VGS few-shot model when the number of examples per class is small. As a result, we apply this approach to an actual low-resource language – Yorùbá. The Yorùbá few-shot model outperforms its English variant. From the few-shot progress we make, we return to the VPKL approach and propose a simpler model similar to our previous VPKL model. Here we assume we have access to a dataset consisting of spoken utterances paired with descriptive images. To mine speech-image training pairs for a keyword, we use a few spoken word examples of the keyword and compare them to the utterances in the dataset’s speech-image pairs. We found that this approach outperforms our previous approach on an English VPKL task and the visual BOW model that detects textual keywords in speech. As a result, we apply this approach to Yorùbá. Since the speech system in the pair mining scheme uses a model trained on English, the precision of the few-shot Yorùbá localisation model is low. However, the ground truth Yorùbá model outperforms the textual keyword localisation model applied to Yorùbá by large margins. In the second part of this dissertation, we ask two more research questions regarding the use of VGS models in computational cognitive studies. Our third research question considers whether a VGS model exhibits the mutual exclusivity (ME) bias which is a word learning constraint used by children. This bias states that a novel word belongs to an unknown object instead of a familiar one. To do this, we use our few-shot object and word learning model and generate a speech-image dataset containing spoken English word and image examples for a set of familiar and novel classes. The model is trained on the word-image pairs for the familiar classes. The model is then prompted with novel English spoken words and asked whether the words belong to unknown or familiar objects. All variants of the model exhibit the ME bias. A model that uses both self-supervised audio and vision initialisations has the strongest ME bias. This makes sense from a cognitive perspective since children are exposed to spoken language and visual stimuli in their surroundings when they begin using the ME bias. Various cognitive ME studies have considered the effect that factors such as multilingualism have on the ME bias. Since this effect has not yet been studied computationally, our fourth research question asks how multilingualism affects the ME bias exhibited by our VGS model. We extend the English ME dataset’s training set to contain spoken Dutch and French words for the familiar classes. We train a trilingual English-Dutch-French model and two bilingual models: an English-Dutch model and an English-French model. These multilingual models are compared to the monolingual English model of the previous research question. We find that the monolingual model has a weaker ME bias than multilingual models. This trend is opposite to the trends seen in children: monolingual children have a stronger ME bias than multilingual children. This study is preliminary and requires further investigation. In summary, we find that VGS models can be used to develop low-resource applications by using only a small set of ground truth examples. We also found that VGS models can be used to computationally study the ME bias observed in children. Further investigation is required into the effect of multilingualism on the bias in VGS models and comparing it to the effect in children. We believe this dissertation has given enough proof of how valuable VGS models can be and will encourage research in this field to build inclusive speech technology and contribute to understanding human language learning. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Doctoral 2025-04-04T07:29:34Z 2025-04-04T07:29:34Z 2024-12 Thesis https://scholar.sun.ac.za/handle/10019.1/131867 Stellenbosch University xiv, 133 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Speech processing systems
Language acquisition -- Computer simulation
Human information processing
Cognition -- Computer simulation
UCTD
Nortje, Aletta Susanna Elizabeth
Visually grounded speech models for low-resource languages and cognitive modelling
title Visually grounded speech models for low-resource languages and cognitive modelling
title_full Visually grounded speech models for low-resource languages and cognitive modelling
title_fullStr Visually grounded speech models for low-resource languages and cognitive modelling
title_full_unstemmed Visually grounded speech models for low-resource languages and cognitive modelling
title_short Visually grounded speech models for low-resource languages and cognitive modelling
title_sort visually grounded speech models for low resource languages and cognitive modelling
topic Speech processing systems
Language acquisition -- Computer simulation
Human information processing
Cognition -- Computer simulation
UCTD
url https://scholar.sun.ac.za/handle/10019.1/131867
work_keys_str_mv AT nortjealettasusannaelizabeth visuallygroundedspeechmodelsforlowresourcelanguagesandcognitivemodelling