Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Automatic video captioning using spatiotemporal convolutions on temporally sampled frames

Thesis (MSc)--Stellenbosch University, 2020.

Saved in:

Bibliographic Details
Main Author:	Nyatsanga, Simbarashe Linval
Other Authors:	Brink, Willie
Format:	Thesis
Language:	en_ZA
Published:	Stellenbosch : Stellenbosch University. 2020
Subjects:	Machine learning Video captioning Neural networks (Computer science) Closed captioning Convolutions (Mathematics) Motion detectors Embeddings (Mathematics) UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613966098759680
access_status_str	Open Access
author	Nyatsanga, Simbarashe Linval
author2	Brink, Willie
author_browse	Brink, Willie Nyatsanga, Simbarashe Linval
author_facet	Brink, Willie Nyatsanga, Simbarashe Linval
author_sort	Nyatsanga, Simbarashe Linval
collection	Thesis
dc_rights_str_mv	Stellenbosch University.
description	Thesis (MSc)--Stellenbosch University, 2020.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/107805
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:44:31.934Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2020
publishDateRange	2020
publishDateSort	2020
publisher	Stellenbosch : Stellenbosch University.
publisherStr	Stellenbosch : Stellenbosch University.
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/107805 Automatic video captioning using spatiotemporal convolutions on temporally sampled frames Nyatsanga, Simbarashe Linval Brink, Willie Stellenbosch University. Faculty of Science. Department of Mathematical Sciences (Applied Mathematics). Machine learning Video captioning Neural networks (Computer science) Closed captioning Convolutions (Mathematics) Motion detectors Embeddings (Mathematics) UCTD Thesis (MSc)--Stellenbosch University, 2020. ENGLISH ABSTRACT: Being able to concisely describe content in a video has tremendous potential to enable better categorisation, indexed based-search and fast content-based retrieval from large video databases. Automatic video captioning requires the simultaneous detection of local and global motion dynamics of objects, scenes and events, to summarise them into a single coherent natural language description. Given the size and complexity of video data, it is important to understand how much temporally coherent visual information is required to adequately describe the video. In order to understand the association between video frames and sentence descriptions, we carry out a systematic study to determine how the quality of generated captions changes with respect to densely or sparsely sampling video frames in the temporal dimension. We conduct a detailed literature review to better understand the background work in image and video captioning. We describe our methodology for building a video caption generator, which is based on deep neural networks called encoder-decoders. We then outline the implementation details of our video caption generator and our experimental setup. In our experimental setup, we explore the role of word embeddings for generating sensible captions with pretrained, jointly trained and finetuned embeddings. We train and evaluate our caption generator on the Microsoft Video Description (MSVD) dataset. Using the standard caption generation evaluation metrics, namely BLEU, METEOR, CIDEr and ROUGE, our experimental results show that sparsely sampling video frames with either finetuned or jointly trained embeddings, results in the best caption quality. Our results are promising in the sense that high quality videos with a large memory footprint could be categorised through a sensible description obtained through sampling a few frames. Finally, our method can be extended such that the sampling rate adapts according to the quality of the video. AFRIKAANSE OPSOMMING: Die vermoë om ’n video se inhoud bondig te beskryf, het geweldige potensiaal vir beter kategorisering, indeksgebaseerde soektogte, en vinnige inhoudgebaseerde ontrekking uit groot video databasisse. Die outomatiese generering van video-onderskrifte vereis die gelyktydige opsporing van lokale en globale bewegingsdinamika van voorwerpe, tonele en gebeure, om in ’n enkele, samehangende, natuurlike taalbeskrywing opgesom te word. Vanweë die grootte en kompleksiteit van video data is dit belangrik om te verstaan hoeveel tyd-samehangende visuele inligting nodig is om die video voldoende te beskryf. Ten einde die verband tussen video-rame en sinbeskrywings te verstaan, voer ons ’n sistematiese studie uit om te bepaal hoe die gehalte van gegenereerde onderskrifte verander soos video-rame digter of yler in die tyd-dimensie gemonster word. Ons voer ’n gedetailleerde literatuurstudie uit om bestaande werk in die generering van beeld- en video-onderskrifte beter te verstaan. Ons beskryf ons metodologie vir die bou van ’n video-onderskrifgenerator, wat gebaseer is op diep neurale netwerke wat enkodeerderdekodeerders genoem word. Ons gee dan ’n uiteensetting van die implementeringsbesonderhede van ons video- nderskrifgenerator en ons eksperimentele opstelling. In ons eksperimentele opstelling ondersoek ons die rol van woordinbeddings vir die generering van sinvolle onderskrifte met vooraf-afgerigte, gesamentlik-afgerigte, en verfynde inbeddings. Ons onderskrifgenerator word afgerig en evalueer op die Microsoft Video Description (MSVD) datastel. Deur gebruik te maak van standaard evalueringsmaatstawwe, naamlik BLEU, METEOR, CIDEr en ROUGE, toon ons eksperimentele resultate dat yl gemonsterde video-rame, met verfynde of gesamentlik-afgerigte inbeddings, die beste onderskrifkwaliteit lewer. Ons resultate is belowend in die sin dat hoë gehalte video’s met groot geheue-vereistes gekategoriseer kan word, deur middel van sinvolle beskrywings vanaf enkele rame. Ons metode kan ook uitgebrei word deur die monstertempo aan te pas volgens die kwaliteit van die video. Masters 2020-02-03T10:54:01Z 2020-04-28T12:04:21Z 2020-02-03T10:54:01Z 2020-04-28T12:04:21Z 2020-03 Thesis http://hdl.handle.net/10019.1/107805 en_ZA Stellenbosch University. 104 pages : Illustrations application/pdf Stellenbosch : Stellenbosch University.
spellingShingle	Machine learning Video captioning Neural networks (Computer science) Closed captioning Convolutions (Mathematics) Motion detectors Embeddings (Mathematics) UCTD Nyatsanga, Simbarashe Linval Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title	Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_full	Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_fullStr	Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_full_unstemmed	Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_short	Automatic video captioning using spatiotemporal convolutions on temporally sampled frames
title_sort	automatic video captioning using spatiotemporal convolutions on temporally sampled frames
topic	Machine learning Video captioning Neural networks (Computer science) Closed captioning Convolutions (Mathematics) Motion detectors Embeddings (Mathematics) UCTD
url	http://hdl.handle.net/10019.1/107805
work_keys_str_mv	AT nyatsangasimbarashelinval automaticvideocaptioningusingspatiotemporalconvolutionsontemporallysampledframes

Full Text Available

Automatic video captioning using spatiotemporal convolutions on temporally sampled frames

Similar Items