Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Multimodal one-shot learning of speech and images

Thesis (MEng)--Stellenbosch University, 2020.

Saved in:

Bibliographic Details
Main Author:	Eloff, Ryan
Other Authors:	Kamper, M. J.
Format:	Thesis
Language:	English
Published:	Stellenbosch : Stellenbosch University 2020
Subjects:	Multimodal machine learning One-shot learning Computer vision Speech synthesis UCTD Audio-visual machine learning
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613782986981376
access_status_str	Open Access
author	Eloff, Ryan
author2	Kamper, M. J.
author_browse	Eloff, Ryan Kamper, M. J.
author_facet	Kamper, M. J. Eloff, Ryan
author_sort	Eloff, Ryan
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2020.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/108185
institution	Stellenbosch University (South Africa)
language	English
last_indexed	2026-06-10T12:41:37.777Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2020
publishDateRange	2020
publishDateSort	2020
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/108185 Multimodal one-shot learning of speech and images Eloff, Ryan Kamper, M. J. Engelbrecht, H. A. Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Multimodal machine learning One-shot learning Computer vision Speech synthesis UCTD Audio-visual machine learning Thesis (MEng)--Stellenbosch University, 2020. ENGLISH ABSTRACT: Humans learn to perform tasks such as language understanding and visual perception, remarkably, without any annotations and from limited amounts of weakly supervised co-occurring sensory information. Meanwhile, state-of-the-art machine learning models—which aim to challenge these human learning abilities—require large amounts of labelled training data to enable successful generalisation. Multimodal one-shot learning is an effort towards closing this gap on human intelligence, whereby we propose benchmark tasks for machine learning systems investigating whether they are capable of performing cross-modal matching from limited weakly supervised data. Specifically, we consider spoken word learning with co-occurring visual context in a one-shot setting, where an agent must learn novel concepts (words and object categories from a single joint audio-visual example. In this thesis, we make the following contributions: (i we propose and formalise multimodal one-shot learning of speech and images; (ii we develop two cross-modal matching benchmark datasets for evaluation, the first containing spoken digits paired with handwritten digits, and the second containing complex natural images paired with spoken words; and (iii we investigate a number of models within two frameworks, one extending unimodal models to the multimodal case, and the other learning joint audio-visual models. Finally, we show that jointly modelling spoken words paired with images enables a novel multimodal gradient update within a meta-learning algorithm for fast adaptation to novel concepts. This model outperforms our other approaches on our most difficult benchmark with a cross-modal matching accuracy of 40.3% for 10-way 5-shot learning. Although we show that there is room for significant improvement, the goal of this work is to encourage further development on this challenging task. We hope to achieve this by defining a standard problem setting with tasks which may be used to benchmark other approaches. AFRIKAANSE OPSOMMING: Die mens het die merkwaardige vermoë om taal en visuele konsepte aan te leer sonder geannoteerde afrigdata deur gebruik te maak van swak toesig in die vorm van parallelle sensoriese intree. Intussen benodig die beste getoesigde masjienleermodelle massiewe geannoteerde datastelle om te veralgemeen na nuwe intrees. Multimodale eenskootmasjienleer is ’n poging om die gaping tussen die vermoëns van masjienleermodelle te oorbrug. Hier stel ons ’n aantal standaard toetse voor om te bepaal of nuwe masjienleerstelsels die vermoë het om kruismodale passing uit te voer uit slegs ’n paar voorbeelde met beperkte toesig. Meer spesifiek ondersoek ons hoe gesproke woorde wat met ooreenstemmende visuele konsepte voorkom, saam aangeleer kan word in ’n eenskootopstelling waar ’n masjien nuwe konsepte (woord en objekkategorieë uit ’n enkele gesamentlike oudiovisuele voorbeeld moet aanleer. Ons maak die volgende bydraes: (i ons formaliseer multimodale eenskootmasjienleer uit spraak en beelde; (ii ons ontwikkel twee datastelle wat dien as maatstawwe om kruismodale passing te evalueer: die eerste datastel bestaan uit gesproke syfers met gepaardgaande handgeskrewe syfers en die tweede bestaan uit meer komplekse fotos met geïsoleerde woorde; en (iii ons ondersoek verskeie masjienleermodelle in twee opstellings: een waar enkelmodale modelle uitgebrei word na die multimodale geval en die ander waar oudiovisuele modelle gesamentlik afgerig word. Laastens ondersoek ons die gesamentlike aanleer van gesproke woorde met gepaardgaande visuele konsepte deur gebruik te maak van ’n meta-leer-algoritme. Hierdie model vaar die beste in ons moeilikste toetsomgewing, met ’n kruismodale passingsakkuraatheid van 40.3% vir 10-rigting 5-skoot masjienleer. Ons hoop dat deur hierdie probleem formeel te definieer en standaard toets beskikbaar te stel, ons verdere navorsing in hierdie nuwe en uitdagende veld sal aanmoedig. Masters 2020-02-24T12:36:37Z 2020-04-28T12:24:07Z 2020-02-24T12:36:37Z 2020-04-28T12:24:07Z 2020-04 Thesis http://hdl.handle.net/10019.1/108185 en Stellenbosch University xi, 81 leaves : illustrations (some color) application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Multimodal machine learning One-shot learning Computer vision Speech synthesis UCTD Audio-visual machine learning Eloff, Ryan Multimodal one-shot learning of speech and images
title	Multimodal one-shot learning of speech and images
title_full	Multimodal one-shot learning of speech and images
title_fullStr	Multimodal one-shot learning of speech and images
title_full_unstemmed	Multimodal one-shot learning of speech and images
title_short	Multimodal one-shot learning of speech and images
title_sort	multimodal one shot learning of speech and images
topic	Multimodal machine learning One-shot learning Computer vision Speech synthesis UCTD Audio-visual machine learning
url	http://hdl.handle.net/10019.1/108185
work_keys_str_mv	AT eloffryan multimodaloneshotlearningofspeechandimages

Full Text Available

Multimodal one-shot learning of speech and images

Similar Items