Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Automatic video captioning using spatiotemporal convolutions on temporally sampled frames

Thesis (MSc)--Stellenbosch University, 2020.

Saved in:
Bibliographic Details
Main Author: Nyatsanga, Simbarashe Linval
Other Authors: Brink, Willie
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University. 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613966098759680
access_status_str Open Access
author Nyatsanga, Simbarashe Linval
author2 Brink, Willie
author_browse Brink, Willie
Nyatsanga, Simbarashe Linval
author_facet Brink, Willie
Nyatsanga, Simbarashe Linval
author_sort Nyatsanga, Simbarashe Linval
collection Thesis
dc_rights_str_mv Stellenbosch University.
description Thesis (MSc)--Stellenbosch University, 2020.
format Thesis
id oai:scholar.sun.ac.za:10019.1/107805
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:44:31.934Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2020
publishDateRange 2020
publishDateSort 2020
publisher Stellenbosch : Stellenbosch University.
publisherStr Stellenbosch : Stellenbosch University.
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/107805 Automatic video captioning using spatiotemporal convolutions on temporally sampled frames Nyatsanga, Simbarashe Linval Brink, Willie Stellenbosch University. Faculty of Science. Department of Mathematical Sciences (Applied Mathematics). Machine learning Video captioning Neural networks (Computer science) Closed captioning Convolutions (Mathematics) Motion detectors Embeddings (Mathematics) UCTD Thesis (MSc)--Stellenbosch University, 2020. ENGLISH ABSTRACT: Being able to concisely describe content in a video has tremendous potential to enable better categorisation, indexed based-search and fast content-based retrieval from large video databases. Automatic video captioning requires the simultaneous detection of local and global motion dynamics of objects, scenes and events, to summarise them into a single coherent natural language description. Given the size and complexity of video data, it is important to understand how much temporally coherent visual information is required to adequately describe the video. In order to understand the association between video frames and sentence descriptions, we carry out a systematic study to determine how the quality of generated captions changes with respect to densely or sparsely sampling video frames in the temporal dimension. We conduct a detailed literature review to better understand the background work in image and video captioning. We describe our methodology for building a video caption generator, which is based on deep neural networks called encoder-decoders. We then outline the implementation details of our video caption generator and our experimental setup. In our experimental setup, we explore the role of word embeddings for generating sensible captions with pretrained, jointly trained and finetuned embeddings. We train and evaluate our caption generator on the Microsoft Video Description (MSVD) dataset. Using the standard caption generation evaluation metrics, namely BLEU, METEOR, CIDEr and ROUGE, our experimental results show that sparsely sampling video frames with either finetuned or jointly trained embeddings, results in the best caption quality. Our results are promising in the sense that high quality videos with a large memory footprint could be categorised through a sensible description obtained through sampling a few frames. Finally, our method can be extended such that the sampling rate adapts according to the quality of the video. AFRIKAANSE OPSOMMING: Die vermoë om ’n video se inhoud bondig te beskryf, het geweldige potensiaal vir beter kategorisering, indeksgebaseerde soektogte, en vinnige inhoudgebaseerde ontrekking uit groot video databasisse. Die outomatiese generering van video-onderskrifte vereis die gelyktydige opsporing van lokale en globale bewegingsdinamika van voorwerpe, tonele en gebeure, om in ’n enkele, samehangende, natuurlike taalbeskrywing opgesom te word. Vanweë die grootte en kompleksiteit van video data is dit belangrik om te verstaan hoeveel tyd-samehangende visuele inligting nodig is om die video voldoende te beskryf. Ten einde die verband tussen video-rame en sinbeskrywings te verstaan, voer ons ’n sistematiese studie uit om te bepaal hoe die gehalte van gegenereerde onderskrifte verander soos video-rame digter of yler in die tyd-dimensie gemonster word. Ons voer ’n gedetailleerde literatuurstudie uit om bestaande werk in die generering van beeld- en video-onderskrifte beter te verstaan. Ons beskryf ons metodologie vir die bou van ’n video-onderskrifgenerator, wat gebaseer is op diep neurale netwerke wat enkodeerderdekodeerders genoem word. Ons gee dan ’n uiteensetting van die implementeringsbesonderhede van ons video- nderskrifgenerator en ons eksperimentele opstelling. In ons eksperimentele opstelling ondersoek ons die rol van woordinbeddings vir die generering van sinvolle onderskrifte met vooraf-afgerigte, gesamentlik-afgerigte, en verfynde inbeddings. Ons onderskrifgenerator word afgerig en evalueer op die Microsoft Video Description (MSVD) datastel. Deur gebruik te maak van standaard evalueringsmaatstawwe, naamlik BLEU, METEOR, CIDEr en ROUGE, toon ons eksperimentele resultate dat yl gemonsterde video-rame, met verfynde of gesamentlik-afgerigte inbeddings, die beste onderskrifkwaliteit lewer. Ons resultate is belowend in die sin dat hoë gehalte video’s met groot geheue-vereistes gekategoriseer kan word, deur middel van sinvolle beskrywings vanaf enkele rame. Ons metode kan ook uitgebrei word deur die monstertempo aan te pas volgens die kwaliteit van die video. Masters 2020-02-03T10:54:01Z 2020-04-28T12:04:21Z 2020-02-03T10:54:01Z 2020-04-28T12:04:21Z 2020-03 Thesis http://hdl.handle.net/10019.1/107805 en_ZA Stellenbosch University. 104 pages : Illustrations application/pdf Stellenbosch : Stellenbosch University.
spellingShingle Machine learning
Video captioning
Neural networks (Computer science)
Closed captioning
Convolutions (Mathematics)
Motion detectors
Embeddings (Mathematics)
UCTD
Nyatsanga, Simbarashe Linval
Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_full Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_fullStr Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_full_unstemmed Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_short Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_sort automatic video captioning using spatiotemporal convolutions on temporally sampled frames
topic Machine learning
Video captioning
Neural networks (Computer science)
Closed captioning
Convolutions (Mathematics)
Motion detectors
Embeddings (Mathematics)
UCTD
url http://hdl.handle.net/10019.1/107805
work_keys_str_mv AT nyatsangasimbarashelinval automaticvideocaptioningusingspatiotemporalconvolutionsontemporallysampledframes