Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Low-resource image captioning

Thesis (MSc) -- Stellenbosch University, 2022.

Saved in:

Bibliographic Details
Main Author:	Du Plessis, Mikkel
Other Authors:	Brink, Willie
Format:	Thesis
Language:	en_ZA
Published:	Stellenbosch : Stellenbosch University 2022
Subjects:	Natural language processing (Computer science) Deep learning (Machine learning) Computer vision Imaging systems in architecture Architectural models Encoder-decoder architecture UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613908604289024
access_status_str	Open Access
author	Du Plessis, Mikkel
author2	Brink, Willie
author_browse	Brink, Willie Du Plessis, Mikkel
author_facet	Brink, Willie Du Plessis, Mikkel
author_sort	Du Plessis, Mikkel
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MSc) -- Stellenbosch University, 2022.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/126059
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:43:37.288Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2022
publishDateRange	2022
publishDateSort	2022
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/126059 Low-resource image captioning Du Plessis, Mikkel Brink, Willie Stellenbosch University. Faculty of Science. Dept. of Applied Mathematics. Natural language processing (Computer science) Deep learning (Machine learning) Computer vision Imaging systems in architecture Architectural models Encoder-decoder architecture UCTD Thesis (MSc) -- Stellenbosch University, 2022. ENGLISH ABSTRACT: Image captioning combines computer vision and natural language processing, and aims to automatically generate a short natural language phrase that describes relationships between objects and context within a given image. As the field of deep learning evolves, several approaches have produced impressive models and generally follow an encoder-decoder architecture. An encoder is utilised for visual cues and a textual decoder to produce a final caption. This can create a challenging gap between visual and textual representations, and makes the training of image captioning models resource intensive. Consequently, recent image captioning models have relied on a steady increase of training set size, computing requirements and training times. This thesis explores the viability of two model architectures for the task of image captioning in a low-resource scenario. We focus specifically on models that can be trained on a single consumer-level GPU in under 5 hours, using only a few thousand images. Our first model is a conventional image captioning model with a pre-trained convolutional neural network as the encoder, followed by an attention mechanism, and an LSTM as the decoder. Our second model utilises a Transformer in the encoder and the decoder. Additionally, we propose three auxiliary techniques that aim to extract more information from images and training captions with only marginal computational overhead. Firstly, we address the typical sparseness in object and scene representation by taking advantage of top-down and bottom-up features, in order to present the decoder with richer visual information and context. Secondly, we suppress semantically unlikely caption candidates during the decoder’s beam search procedure through the inclusion of a language model. Thirdly, we enhance the expressiveness of the model by augmenting training captions with a paraphrase generator. We find that the Transformer-based architecture is superior under low-data circumstances. Through a combination of all proposed methods applied, we achieve state-of-the-art performance on the Flickr8k test set and surpass existing recurrent-based methods. To further validate the generalisability of our models, we train on small, randomly sampled subsets of the MS COCO dataset and achieve competitive test scores compared to existing models trained on the full dataset. AFRIKAANS OPSOMMING: Beeldonderskrifte kombineer rekenaarvisie en natuurlike taalverwerking, en is daarop gemik om outomaties ’n kort natuurlike taalfrase te genereer wat die verhoudings tussen voorwerpe en konteks binne ’n gegewe beeld beskryf. Met die groei van diepleer as ’n veld, lewer verskeie benaderings nou indruk wekkende modelle, en volg gewoonlik ’n enkodeerder-dekodeerder-argitektuur. ’n Enkodeerder word gebruik vir visuele kenmerke en ’n tekstuele dekodeerder om ’n finale onderskrif te produseer. Dit kan ’n uitdagende gaping tussen visuele en tekstuele voorstellings skep, wat die afrigting van beeldonderskrifte modelle hulpbron-intensief maak. Gevolglik het onlangse modelle staatgemaak op groot opleidingsstelle, rekenaarvereistes en opleidingstye. Hierdie tesis ondersoek die lewensvatbaarheid van twee modelargitekture vir die taak van beeldonderskrifte in ’n scenario met beperkte bronne. Ons fokus spesifiek op modelle wat in minder as 5 ure op ’n enkele verbruikervlak GPU opgelei kan word, met slegs ’n paar duisend beelde. Ons eerste model is ’n konvensionele beeldonderskrifmodel met ’n vooraf-afgerigte konvolusionele neurale netwerk as die enkodeerder, gevolg deur ’n aandagmeganisme, en ’n LSTM as die dekodeerder. Ons tweede model gebruik ’n Transformator in die enkodeerder en die dekodeerder. Daarbenewens stel ons drie hulptegnieke voor wat daarop gemik is om bykomende inligting uit beelde en opleidingson derskrifte te onttrek met slegs marginale berekeningskoste. Eerstens spreek ons die tipiese ylheid in voorwerp- en toneelvoorstelling aan deur voordeel te trek uit bo-na-onder en onder-na-bo-kenmerke, om die dekodeerder met ryker visuele inligting en konteks te voorsien. Tweedens onderdruk ons semanties onwaarskynlike onderskrifkandidate tydens die dekodeerder se straalsoek prosedure deur die insluiting van ’n taalmodel. Derdens verbeter ons die ekspressiwiteit van die model deur opleidingsonderskrifte aan te vul met ’n parafrasegenerator. Ons vind dat die Transformator-gebaseerde argitektuur beter vaar onder lae-data-omstandighede. Deur ’n kombinasie van alle voorgestelde metodes wat toegepas word, bereik ons die beste resultaat op die Flickr8k-toetsstel en oortref ons bestaande rekursie-gebaseerde metodes. Om die veralgemeenbaar heid van ons modelle verder te evalueer, rig ons hulle af op klein, ewekansige subversamelings van die MS COCO-datastel en behaal mededingende toet sresultate in vergelyking met bestaande modelle wat met die volle datastel opgelei is. Masters 2022-11-22T08:27:16Z 2023-01-16T12:47:57Z 2022-11-22T08:27:16Z 2023-01-16T12:47:57Z 2022-12 Thesis http://hdl.handle.net/10019.1/126059 en_ZA Stellenbosch University vi, 82 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Natural language processing (Computer science) Deep learning (Machine learning) Computer vision Imaging systems in architecture Architectural models Encoder-decoder architecture UCTD Du Plessis, Mikkel Low-resource image captioning
title	Low-resource image captioning
title_full	Low-resource image captioning
title_fullStr	Low-resource image captioning
title_full_unstemmed	Low-resource image captioning
title_short	Low-resource image captioning
title_sort	low resource image captioning
topic	Natural language processing (Computer science) Deep learning (Machine learning) Computer vision Imaging systems in architecture Architectural models Encoder-decoder architecture UCTD
url	http://hdl.handle.net/10019.1/126059
work_keys_str_mv	AT duplessismikkel lowresourceimagecaptioning

Full Text Available

Low-resource image captioning

Similar Items