Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Direct and indirect multimodal few-shot learning of speech and images

Thesis (MEng)--Stellenbosch University, 2020.

Saved in:
Bibliographic Details
Main Author: Nortje, Leanne
Other Authors: Kamper, Herman
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613849237061632
access_status_str Open Access
author Nortje, Leanne
author2 Kamper, Herman
author_browse Kamper, Herman
Nortje, Leanne
author_facet Kamper, Herman
Nortje, Leanne
author_sort Nortje, Leanne
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2020.
format Thesis
id oai:scholar.sun.ac.za:10019.1/109314
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:42:40.195Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2020
publishDateRange 2020
publishDateSort 2020
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/109314 Direct and indirect multimodal few-shot learning of speech and images Nortje, Leanne Kamper, Herman Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Few-shot UCTD Multimodal user interfaces (Computer systems) Speech-vision system Transfer learning Metric system -- Juvenile literature Speech processing systems Thesis (MEng)--Stellenbosch University, 2020. ENGLISH ABSTRACT: Children have the ability to learn new words and corresponding visual objects fromonly a few word-object example pairs. This raises the question of whether we can findmultimodal speech-vision systems which can learn as rapidly from only a few examplepairs. Imagine an agent like a household robot is shown an image along with a spokenword describing the object in the image, e.g.teddy,monkeyanddog. After observing asingle paired example per class, it is shown a new set of unseen pictures, and asked topick the “teddy”. This problem is referred to asmultimodal one-shot matching. If morethan one paired speech-image example is given per concept type, it is calledmultimodalfew-shot matching. In both cases, the set of initial paired examples is referred to as thesupport set. This thesis makes two core contributions. Firstly, we compare unsupervised learning totransfer learning for an indirect multimodal few-shot matching approach on a dataset ofpaired isolated spoken and visual digits. Transfer learning (which was used in a previousstudy) involves training models on labelled background data not containing any of thefew-shot classes; it is conceivable that children use previously gained knowledge to learnnew concepts. It is also conceivable that prior to seeing the few-shot pairs, a householdrobot or child would be exposed to unlabelled in-domain data from its environment;we therefore consider unsupervised learning for this problem which we are also thefirst to do. In unsupervised learning, models are trained on unlabelled in-domain data.From all our experiments, we find that transfer learning outperforms unsupervised learning. Indirect models (which were used in our first contribution) consist of two separateunimodal networks with the support set acting as a pivot between the modalities. Incontrast, a direct model would learn a single multimodal space in which representationsfrom the two modalities can be directly compared. We propose two new direct multimodalnetworks: a multimodal triplet network (MTriplet) which combines two triplet losses, anda multimodal correspondence autoencoder (MCAE) which combines two correspondenceautoencoders (CAEs). Both these models require paired speech-image examples for training.Since the support set is not sufficient for this purpose, we propose a new pair miningapproach in which pairs are constructed automatically from unlabelled in-domain data using the support set as a pivot. This pair mining approach combines unsupervised and transferlearning, since we use transfer learned unimodal classifiers to extract representations forthe unlabelled in-domain data. We show that these direct models consistently outperformthe indirect models, with the MTriplet as the top performer. These direct few-shot modelsshow potential towards finding systems that learn from little labelled data while beingcapable of rapidly connecting data from different modalities. AFRIKAANSE OPSOMMING: Kinders het die vermo ̈e om nuwe woorde en ooreenstemmende visuele voorwerpe teleer van slegs ‘n paar oudiovisuele voorbeeldpare. Dit bring die vraag na vore of onsveelvuldige-modaliteit oudiovisuele sisteme kan kry wat so vinnig van ‘n paar voorbeeldparekan leer. Stel jou voor dat daar vir ‘n agent soos ‘n huishoudelike robot, ‘n beeld met ‘ngesproke woord wat die voorwerp in die beeld beskryf, gegee word, b.v.teddiebeer,apieenhond.Nadat ‘n enkele voorbeeld paar per klas waargeneem is, word die agent gevra omdie “teddiebeer” in ‘n nuwe stel beelde te kies. Daar word na die probleem verwys asveelvuldige-modaliteit eenskoot-passing. Indien meer as een oudiovisuele voorbeeld paargegee is vir elke konsep tipe, word ditveelvuldige-modaliteit meerskoot-passinggenoem.In beide gevalle verwys ons na die stel oorspronklike voorbeeldpare as dieondersteuningsstel. Hierdie proefskrif maak twee kern bydraes. Eerstens, vergelyk ons sonder-toesig-leerteenoor oordragsleer vir ‘n indirekte veelvuldige-modaliteit meerskoort-passing benaderingop ‘n datastel van ooreenstemmende beelde en ge ̈ısoleerde gesproke syfers. Oordragsleer(wat in ‘n vorige studie gebruik is) behels die afrig van modelle op agtergrond data watnie enige van die meerskoot klasse bevat nie; dit word gemotiveer aangesien kinderskennis gebruik wat hulle voorheen opgedoen het om nuwe konsepte te leer. Voor diehuishoudelike robot of kind die meerskoot pare sien, is dit ook moontlik dat hy/syvanaf die omgewing blootgestel word aan binne-domein data sonder annotasies. Onsoorweeg daarom leer-sonder-toesig vir die probleem en is die eerstes om dit te doen.In leer-sonder-toesig, word modelle afgerig op binne-domein data sonder annotasies.Gebasseer op al ons eksperimente, vind ons dat oordragsleer beter as leer-sonder-toesigpresteer. ndirekte modelle (wat in ons eerste bydrae gebruik is) bestaan uit twee aparteenkelmodaliteit netwerke met die ondersteuningsstel wat dien as ‘n spilpunt tussen diemodaliteite. In plaas hiervan leer ‘n direkte model ‘n enkele veelvuldige-modaliteit-ruimtewaarin voorstellings vanaf twee modaliteite direk vergelyk kan word. Ons stel tweenuwe direkte modelle voor: ‘n veelvuldige-modaliteit drieling-model (VMDrieling) wattwee drieling koste-funksies kombineer, en ‘n veelvuldige-modaliteit korrespondensie-outo-enkodeerder (VMOE) wat twee outo-enkodeerders (OEs) kombineer. Al die modelle vereis gepaarde oudiovisuele voorbeelde tydens afrigting. Aangesien die ondersteuningsstel nievoldoende is hiervoor nie, stel ons ‘n nuwe ontginningsskema voor waarin pare automatiesopgestel word vanaf binne-domein data sonder annotasies, met die ondersteuningstel watas ‘n spilpunt gebruik word. Hierdie ontginningsskema kombineer oordragsleer en leer-sonder-toesig aangesien ons enkelmodaliteit klassifiseerders wat afgerig is met oordragsleergebruik om voorstellings vir binne-domein data sonder annotasies, te verkry. Ons wysdat hierdie direkte modelle konsekwent beter presteer as die indirekte modelle, met dieVMDrieling as die beste presteerder. Hierdie direkte modelle toon potensiaal om sistemete vind wat van min geannoteerde data leer terwyl dit terselfdertyd data vanaf verskillendemodaliteite aanmekaar kan verbind. Masters 2020-11-24T13:12:43Z 2021-01-31T19:44:19Z 2020-11-24T13:12:43Z 2021-01-31T19:44:19Z 2020-12 Thesis http://hdl.handle.net/10019.1/109314 en_ZA Stellenbosch University 138 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle Few-shot
UCTD
Multimodal user interfaces (Computer systems)
Speech-vision system
Transfer learning
Metric system -- Juvenile literature
Speech processing systems
Nortje, Leanne
Direct and indirect multimodal few-shot learning of speech and images
title Direct and indirect multimodal few-shot learning of speech and images
title_full Direct and indirect multimodal few-shot learning of speech and images
title_fullStr Direct and indirect multimodal few-shot learning of speech and images
title_full_unstemmed Direct and indirect multimodal few-shot learning of speech and images
title_short Direct and indirect multimodal few-shot learning of speech and images
title_sort direct and indirect multimodal few shot learning of speech and images
topic Few-shot
UCTD
Multimodal user interfaces (Computer systems)
Speech-vision system
Transfer learning
Metric system -- Juvenile literature
Speech processing systems
url http://hdl.handle.net/10019.1/109314
work_keys_str_mv AT nortjeleanne directandindirectmultimodalfewshotlearningofspeechandimages