Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Visually grounded keyword detection and localisation for low-resource languages

Thesis (PhD)--Stellenbosch University, 2023.

Saved in:
Bibliographic Details
Main Author: Olaleye, Kayode
Other Authors: Kamper, Herman
Format: Thesis
Language:en_ZA
en_ZA
Published: Stellenbosch : Stellenbosch University 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613895138476032
access_status_str Open Access
author Olaleye, Kayode
author2 Kamper, Herman
author_browse Kamper, Herman
Olaleye, Kayode
author_facet Kamper, Herman
Olaleye, Kayode
author_sort Olaleye, Kayode
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (PhD)--Stellenbosch University, 2023.
format Thesis
id oai:scholar.sun.ac.za:10019.1/127180
institution Stellenbosch University (South Africa)
language en_ZA
en_ZA
last_indexed 2026-06-10T12:43:24.214Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/127180 Visually grounded keyword detection and localisation for low-resource languages Olaleye, Kayode Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Visually Grounded Keyword Detection and Localisation for Low-Resource Languages Low-resource languages Directional hearing Yoruba language Neural computers Speech processing systems Keyword searching Thesis (PhD)--Stellenbosch University, 2023. ENGLISH ABSTRACT: Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get transcribed data, e.g. for documenting unwritten languages. We investigate keyword localisation in speech—finding where in an utterance a given written keyword occurs—using VGS models trained in a real low-resource setting. Existing VGS studies fall short in two areas. Firstly, previous work has shown that VGS models can be used for tasks such as cross-modal retrieval, keyword detection and keyword spotting, but keyword localisation has not been explored. Secondly, most previous VGS studies use datasets where images are paired with speech in English (or another well-resourced language). English is therefore often used as a proxy for a low-resource language, making it difficult to accurately assess their performance in a real low-resource setting. Based on this, we address the following two overarching research questions: (i) Is keyword localisation possible with VGS models? (ii) In a real low-resource setting, can we do visually grounded keyword localisation cross-lingually? To address the first question, we augment and extend existing VGS models with the ability to not only detect, but also localise written keywords. For this research question, we constrain ourselves to the artificial low-resource setting where English VGS data is used, allowing us to compare and directly extend previous work. We use as starting point an existing methodology for training VGS models to detect keywords in speech: training images are tagged with soft textual labels using an existing offline image tagger, and these tags are then used as targets to train a speech network. I.e., the model receives a noisy target for whether words occur in an utterance, but not where or in which order. We extend this model using four localisation methods. Input masking masks the input signal at different locations and measures the difference in the output unit for a particular keyword. Attention localisation requires an attention layer that pools features over the temporal axis; we use the attention weights as localisation scores. Grad-CAM is a saliency-based method that can be applied to any convolutional neural network to determine which parts of the network input most contribute to a particular output decision. The score aggregation method uses a particular type of pooling so that the output score can be regarded as an aggregation of local scores; these can be used to select the most likely temporal location for a query keyword. In an oracle localisation test (where the model is told that a keyword is present in an utterance and then asked where it occurs), the masked-based localisation method achieves an accuracy of 57.0%, outperforming all the other approaches, with the attention-based method coming second with 46.0%. To tackle the second research question (cross-lingual keyword localisation in a real low-resource setting), we start by collecting and releasing a new VGS dataset. The Yor`ub´a Flickr Audio Caption Corpus (YFACC) dataset contains spoken captions for 6k Flickr images produced by a single speaker in Yor`ub´a: a real low-resource language spoken in Nigeria. Using this data, we consider the problem of cross-lingual keyword detection and localisation: given an English text query, we detect whether the query occurs in Yor`ub´a speech, and if it is detected, we localise where in the utterance the query occurs. To build this VGS system, images are automatically tagged with English visual labels serving as targets for an attention-based model that takes Yor`ub´a speech as input. Then we apply the attention-based localisation method to do cross-lingual keyword detection and localisation for the first time in a real low-resource setting. The cross-lingual model obtains a precision of 16.0% in actual keyword localisation which involves first detecting whether a keyword occurs before doing localisation. Although this result is modest when viewed in isolation, this is a model trained without any parallel English-Yor`ub´a data or any transcriptions. We find that the performance can be improved by initialising the cross-lingual model from a model pretrained on the English image–speech dataset, giving a result of 22.8%. In answering the two main research questions, we make the following concrete contributions: (1) We propose a new VGS model for keyword detection and keyword spotting using attention, and carry out a thorough comparison to existing VGS-based methods. (2) VGS models are extended with four localisation methods. (3) We present a detailed quantitative and qualitative analysis revealing the limits of the models above, showcasing their success and failure modes. We observe good localisation matches for some of the 67 keywords in the system’s vocabulary (black, pool, soccer, tree), while others are confused with semantically related words: ocean → surfer; ball → soccer; swimming → pool. (4) We release a new multimodal, multilingual dataset which enables VGS modelling in a real low-resource setting, resembling a language documentation scenario. The dataset extends the Flickr8k image–text dataset to include Yor`ub´a spoken captions. (5) We introduce a system for cross-lingual keyword detection and keyword localisation in a real low-resource setting using our new Yor`ub´a speech–image dataset. (6) We provide a comprehensive analysis of the cross-lingual VGS model. We observe that there are keywords with good performance, such as brown (b´ur´a`un; 100.0% precision), bike (k`e. k´e. ; 94.1%) and grass (kor´ıko 90.9%). But there are many others on which the model struggles due to poor visual grounding and confusion between semantically related concepts. In summary, we show that VGS models can be used for a limited form of keyword localisation in a real low-resource setting. We hope that our new dataset and new findings will stimulate more research in the use of VGS models for real low-resource languages. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Doctoral 2023-03-03T11:51:18Z 2023-05-18T07:08:27Z 2023-03-03T11:51:18Z 2023-05-18T07:08:27Z 2023-03 Thesis http://hdl.handle.net/10019.1/127180 en_ZA en_ZA Stellenbosch University xv, 108 pages : illustrations. application/pdf Stellenbosch : Stellenbosch University
spellingShingle Visually Grounded Keyword Detection and Localisation for Low-Resource Languages
Low-resource languages
Directional hearing
Yoruba language
Neural computers
Speech processing systems
Keyword searching
Olaleye, Kayode
Visually grounded keyword detection and localisation for low-resource languages
title Visually grounded keyword detection and localisation for low-resource languages
title_full Visually grounded keyword detection and localisation for low-resource languages
title_fullStr Visually grounded keyword detection and localisation for low-resource languages
title_full_unstemmed Visually grounded keyword detection and localisation for low-resource languages
title_short Visually grounded keyword detection and localisation for low-resource languages
title_sort visually grounded keyword detection and localisation for low resource languages
topic Visually Grounded Keyword Detection and Localisation for Low-Resource Languages
Low-resource languages
Directional hearing
Yoruba language
Neural computers
Speech processing systems
Keyword searching
url http://hdl.handle.net/10019.1/127180
work_keys_str_mv AT olaleyekayode visuallygroundedkeyworddetectionandlocalisationforlowresourcelanguages