Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Direct and indirect multimodal few-shot learning of speech and images

Thesis (MEng)--Stellenbosch University, 2020.

Saved in:

Bibliographic Details
Main Author:	Nortje, Leanne
Other Authors:	Kamper, Herman
Format:	Thesis
Language:	en_ZA
Published:	Stellenbosch : Stellenbosch University 2020
Subjects:	Few-shot UCTD Multimodal user interfaces (Computer systems) Speech-vision system Transfer learning Metric system > Juvenile literature Speech processing systems
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613849237061632
access_status_str	Open Access
author	Nortje, Leanne
author2	Kamper, Herman
author_browse	Kamper, Herman Nortje, Leanne
author_facet	Kamper, Herman Nortje, Leanne
author_sort	Nortje, Leanne
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2020.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/109314
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:42:40.195Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2020
publishDateRange	2020
publishDateSort	2020
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/109314 Direct and indirect multimodal few-shot learning of speech and images Nortje, Leanne Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Few-shot UCTD Multimodal user interfaces (Computer systems) Speech-vision system Transfer learning Metric system -- Juvenile literature Speech processing systems Thesis (MEng)--Stellenbosch University, 2020. ENGLISH ABSTRACT: Children have the ability to learn new words and corresponding visual objects fromonly a few word-object example pairs. This raises the question of whether we can findmultimodal speech-vision systems which can learn as rapidly from only a few examplepairs. Imagine an agent like a household robot is shown an image along with a spokenword describing the object in the image, e.g.teddy,monkeyanddog. After observing asingle paired example per class, it is shown a new set of unseen pictures, and asked topick the “teddy”. This problem is referred to asmultimodal one-shot matching. If morethan one paired speech-image example is given per concept type, it is calledmultimodalfew-shot matching. In both cases, the set of initial paired examples is referred to as thesupport set. This thesis makes two core contributions. Firstly, we compare unsupervised learning totransfer learning for an indirect multimodal few-shot matching approach on a dataset ofpaired isolated spoken and visual digits. Transfer learning (which was used in a previousstudy) involves training models on labelled background data not containing any of thefew-shot classes; it is conceivable that children use previously gained knowledge to learnnew concepts. It is also conceivable that prior to seeing the few-shot pairs, a householdrobot or child would be exposed to unlabelled in-domain data from its environment;we therefore consider unsupervised learning for this problem which we are also thefirst to do. In unsupervised learning, models are trained on unlabelled in-domain data.From all our experiments, we find that transfer learning outperforms unsupervised learning. Indirect models (which were used in our first contribution) consist of two separateunimodal networks with the support set acting as a pivot between the modalities. Incontrast, a direct model would learn a single multimodal space in which representationsfrom the two modalities can be directly compared. We propose two new direct multimodalnetworks: a multimodal triplet network (MTriplet) which combines two triplet losses, anda multimodal correspondence autoencoder (MCAE) which combines two correspondenceautoencoders (CAEs). Both these models require paired speech-image examples for training.Since the support set is not sufficient for this purpose, we propose a new pair miningapproach in which pairs are constructed automatically from unlabelled in-domain data using the support set as a pivot. This pair mining approach combines unsupervised and transferlearning, since we use transfer learned unimodal classifiers to extract representations forthe unlabelled in-domain data. We show that these direct models consistently outperformthe indirect models, with the MTriplet as the top performer. These direct few-shot modelsshow potential towards finding systems that learn from little labelled data while beingcapable of rapidly connecting data from different modalities. AFRIKAANSE OPSOMMING: Kinders het die vermo ̈e om nuwe woorde en ooreenstemmende visuele voorwerpe teleer van slegs ‘n paar oudiovisuele voorbeeldpare. Dit bring die vraag na vore of onsveelvuldige-modaliteit oudiovisuele sisteme kan kry wat so vinnig van ‘n paar voorbeeldparekan leer. Stel jou voor dat daar vir ‘n agent soos ‘n huishoudelike robot, ‘n beeld met ‘ngesproke woord wat die voorwerp in die beeld beskryf, gegee word, b.v.teddiebeer,apieenhond.Nadat ‘n enkele voorbeeld paar per klas waargeneem is, word die agent gevra omdie “teddiebeer” in ‘n nuwe stel beelde te kies. Daar word na die probleem verwys asveelvuldige-modaliteit eenskoot-passing. Indien meer as een oudiovisuele voorbeeld paargegee is vir elke konsep tipe, word ditveelvuldige-modaliteit meerskoot-passinggenoem.In beide gevalle verwys ons na die stel oorspronklike voorbeeldpare as dieondersteuningsstel. Hierdie proefskrif maak twee kern bydraes. Eerstens, vergelyk ons sonder-toesig-leerteenoor oordragsleer vir ‘n indirekte veelvuldige-modaliteit meerskoort-passing benaderingop ‘n datastel van ooreenstemmende beelde en ge ̈ısoleerde gesproke syfers. Oordragsleer(wat in ‘n vorige studie gebruik is) behels die afrig van modelle op agtergrond data watnie enige van die meerskoot klasse bevat nie; dit word gemotiveer aangesien kinderskennis gebruik wat hulle voorheen opgedoen het om nuwe konsepte te leer. Voor diehuishoudelike robot of kind die meerskoot pare sien, is dit ook moontlik dat hy/syvanaf die omgewing blootgestel word aan binne-domein data sonder annotasies. Onsoorweeg daarom leer-sonder-toesig vir die probleem en is die eerstes om dit te doen.In leer-sonder-toesig, word modelle afgerig op binne-domein data sonder annotasies.Gebasseer op al ons eksperimente, vind ons dat oordragsleer beter as leer-sonder-toesigpresteer. ndirekte modelle (wat in ons eerste bydrae gebruik is) bestaan uit twee aparteenkelmodaliteit netwerke met die ondersteuningsstel wat dien as ‘n spilpunt tussen diemodaliteite. In plaas hiervan leer ‘n direkte model ‘n enkele veelvuldige-modaliteit-ruimtewaarin voorstellings vanaf twee modaliteite direk vergelyk kan word. Ons stel tweenuwe direkte modelle voor: ‘n veelvuldige-modaliteit drieling-model (VMDrieling) wattwee drieling koste-funksies kombineer, en ‘n veelvuldige-modaliteit korrespondensie-outo-enkodeerder (VMOE) wat twee outo-enkodeerders (OEs) kombineer. Al die modelle vereis gepaarde oudiovisuele voorbeelde tydens afrigting. Aangesien die ondersteuningsstel nievoldoende is hiervoor nie, stel ons ‘n nuwe ontginningsskema voor waarin pare automatiesopgestel word vanaf binne-domein data sonder annotasies, met die ondersteuningstel watas ‘n spilpunt gebruik word. Hierdie ontginningsskema kombineer oordragsleer en leer-sonder-toesig aangesien ons enkelmodaliteit klassifiseerders wat afgerig is met oordragsleergebruik om voorstellings vir binne-domein data sonder annotasies, te verkry. Ons wysdat hierdie direkte modelle konsekwent beter presteer as die indirekte modelle, met dieVMDrieling as die beste presteerder. Hierdie direkte modelle toon potensiaal om sistemete vind wat van min geannoteerde data leer terwyl dit terselfdertyd data vanaf verskillendemodaliteite aanmekaar kan verbind. Masters 2020-11-24T13:12:43Z 2021-01-31T19:44:19Z 2020-11-24T13:12:43Z 2021-01-31T19:44:19Z 2020-12 Thesis http://hdl.handle.net/10019.1/109314 en_ZA Stellenbosch University 138 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Few-shot UCTD Multimodal user interfaces (Computer systems) Speech-vision system Transfer learning Metric system -- Juvenile literature Speech processing systems Nortje, Leanne Direct and indirect multimodal few-shot learning of speech and images
title	Direct and indirect multimodal few-shot learning of speech and images
title_full	Direct and indirect multimodal few-shot learning of speech and images
title_fullStr	Direct and indirect multimodal few-shot learning of speech and images
title_full_unstemmed	Direct and indirect multimodal few-shot learning of speech and images
title_short	Direct and indirect multimodal few-shot learning of speech and images
title_sort	direct and indirect multimodal few shot learning of speech and images
topic	Few-shot UCTD Multimodal user interfaces (Computer systems) Speech-vision system Transfer learning Metric system -- Juvenile literature Speech processing systems
url	http://hdl.handle.net/10019.1/109314
work_keys_str_mv	AT nortjeleanne directandindirectmultimodalfewshotlearningofspeechandimages

Full Text Available

Direct and indirect multimodal few-shot learning of speech and images

Similar Items