Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Low-resource image captioning

Thesis (MSc) -- Stellenbosch University, 2022.

Saved in:
Bibliographic Details
Main Author: Du Plessis, Mikkel
Other Authors: Brink, Willie
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613908604289024
access_status_str Open Access
author Du Plessis, Mikkel
author2 Brink, Willie
author_browse Brink, Willie
Du Plessis, Mikkel
author_facet Brink, Willie
Du Plessis, Mikkel
author_sort Du Plessis, Mikkel
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc) -- Stellenbosch University, 2022.
format Thesis
id oai:scholar.sun.ac.za:10019.1/126059
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:43:37.288Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/126059 Low-resource image captioning Du Plessis, Mikkel Brink, Willie Stellenbosch University. Faculty of Science. Dept. of Applied Mathematics. Natural language processing (Computer science) Deep learning (Machine learning) Computer vision Imaging systems in architecture Architectural models Encoder-decoder architecture UCTD Thesis (MSc) -- Stellenbosch University, 2022. ENGLISH ABSTRACT: Image captioning combines computer vision and natural language processing, and aims to automatically generate a short natural language phrase that describes relationships between objects and context within a given image. As the field of deep learning evolves, several approaches have produced impressive models and generally follow an encoder-decoder architecture. An encoder is utilised for visual cues and a textual decoder to produce a final caption. This can create a challenging gap between visual and textual representations, and makes the training of image captioning models resource intensive. Consequently, recent image captioning models have relied on a steady increase of training set size, computing requirements and training times. This thesis explores the viability of two model architectures for the task of image captioning in a low-resource scenario. We focus specifically on models that can be trained on a single consumer-level GPU in under 5 hours, using only a few thousand images. Our first model is a conventional image captioning model with a pre-trained convolutional neural network as the encoder, followed by an attention mechanism, and an LSTM as the decoder. Our second model utilises a Transformer in the encoder and the decoder. Additionally, we propose three auxiliary techniques that aim to extract more information from images and training captions with only marginal computational overhead. Firstly, we address the typical sparseness in object and scene representation by taking advantage of top-down and bottom-up features, in order to present the decoder with richer visual information and context. Secondly, we suppress semantically unlikely caption candidates during the decoder’s beam search procedure through the inclusion of a language model. Thirdly, we enhance the expressiveness of the model by augmenting training captions with a paraphrase generator. We find that the Transformer-based architecture is superior under low-data circumstances. Through a combination of all proposed methods applied, we achieve state-of-the-art performance on the Flickr8k test set and surpass existing recurrent-based methods. To further validate the generalisability of our models, we train on small, randomly sampled subsets of the MS COCO dataset and achieve competitive test scores compared to existing models trained on the full dataset. AFRIKAANS OPSOMMING: Beeldonderskrifte kombineer rekenaarvisie en natuurlike taalverwerking, en is daarop gemik om outomaties ’n kort natuurlike taalfrase te genereer wat die verhoudings tussen voorwerpe en konteks binne ’n gegewe beeld beskryf. Met die groei van diepleer as ’n veld, lewer verskeie benaderings nou indruk wekkende modelle, en volg gewoonlik ’n enkodeerder-dekodeerder-argitektuur. ’n Enkodeerder word gebruik vir visuele kenmerke en ’n tekstuele dekodeerder om ’n finale onderskrif te produseer. Dit kan ’n uitdagende gaping tussen visuele en tekstuele voorstellings skep, wat die afrigting van beeldonderskrifte modelle hulpbron-intensief maak. Gevolglik het onlangse modelle staatgemaak op groot opleidingsstelle, rekenaarvereistes en opleidingstye. Hierdie tesis ondersoek die lewensvatbaarheid van twee modelargitekture vir die taak van beeldonderskrifte in ’n scenario met beperkte bronne. Ons fokus spesifiek op modelle wat in minder as 5 ure op ’n enkele verbruikervlak GPU opgelei kan word, met slegs ’n paar duisend beelde. Ons eerste model is ’n konvensionele beeldonderskrifmodel met ’n vooraf-afgerigte konvolusionele neurale netwerk as die enkodeerder, gevolg deur ’n aandagmeganisme, en ’n LSTM as die dekodeerder. Ons tweede model gebruik ’n Transformator in die enkodeerder en die dekodeerder. Daarbenewens stel ons drie hulptegnieke voor wat daarop gemik is om bykomende inligting uit beelde en opleidingson derskrifte te onttrek met slegs marginale berekeningskoste. Eerstens spreek ons die tipiese ylheid in voorwerp- en toneelvoorstelling aan deur voordeel te trek uit bo-na-onder en onder-na-bo-kenmerke, om die dekodeerder met ryker visuele inligting en konteks te voorsien. Tweedens onderdruk ons semanties onwaarskynlike onderskrifkandidate tydens die dekodeerder se straalsoek prosedure deur die insluiting van ’n taalmodel. Derdens verbeter ons die ekspressiwiteit van die model deur opleidingsonderskrifte aan te vul met ’n parafrasegenerator. Ons vind dat die Transformator-gebaseerde argitektuur beter vaar onder lae-data-omstandighede. Deur ’n kombinasie van alle voorgestelde metodes wat toegepas word, bereik ons die beste resultaat op die Flickr8k-toetsstel en oortref ons bestaande rekursie-gebaseerde metodes. Om die veralgemeenbaar heid van ons modelle verder te evalueer, rig ons hulle af op klein, ewekansige subversamelings van die MS COCO-datastel en behaal mededingende toet sresultate in vergelyking met bestaande modelle wat met die volle datastel opgelei is. Masters 2022-11-22T08:27:16Z 2023-01-16T12:47:57Z 2022-11-22T08:27:16Z 2023-01-16T12:47:57Z 2022-12 Thesis http://hdl.handle.net/10019.1/126059 en_ZA Stellenbosch University vi, 82 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Natural language processing (Computer science)
Deep learning (Machine learning)
Computer vision
Imaging systems in architecture
Architectural models
Encoder-decoder architecture
UCTD
Du Plessis, Mikkel
Low-resource image captioning
title Low-resource image captioning
title_full Low-resource image captioning
title_fullStr Low-resource image captioning
title_full_unstemmed Low-resource image captioning
title_short Low-resource image captioning
title_sort low resource image captioning
topic Natural language processing (Computer science)
Deep learning (Machine learning)
Computer vision
Imaging systems in architecture
Architectural models
Encoder-decoder architecture
UCTD
url http://hdl.handle.net/10019.1/126059
work_keys_str_mv AT duplessismikkel lowresourceimagecaptioning