Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Recurrent neural network language models in the context of under-resourced South African languages

Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced...

Full description

Saved in:

Bibliographic Details
Main Author:	Scarcella, Alessandro
Other Authors:	Lacerda, Miguel
Format:	Thesis
Language:	English
Published:	Department of Statistical Sciences 2019
Subjects:	Statistics
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613291898994688
access_status_str	Open Access
author	Scarcella, Alessandro
author2	Lacerda, Miguel
author_browse	Lacerda, Miguel Scarcella, Alessandro
author_facet	Lacerda, Miguel Scarcella, Alessandro
author_sort	Scarcella, Alessandro
collection	Thesis
description	Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced languages, have received little attention and have yet to benefit from recent advances. This investigation aims to evaluate the implications of recent advances in neural network language modelling techniques for under-resourced South African languages. Rudimentary, single layered recurrent neural networks (RNN) were used to model four South African text corpora. The accuracy of these models were compared directly to legacy approaches. A suite of hybrid models was then tested. Across all four datasets, neural networks led to overall better performing language models either directly or as part of a hybrid model. A short examination of punctuation marks in text data revealed that performance metrics for language models are greatly overestimated when punctuation marks have not been excluded. The investigation concludes by appraising the sensitivity of RNN language models (RNNLMs) to the size of the datasets by artificially constraining the datasets and evaluating the accuracy of the models. It is recommended that future research endeavours within this domain are directed towards evaluating more sophisticated RNNLMs as well as measuring their impact on application focused tasks such as speech recognition and machine translation.
format	Thesis
id	oai:open.uct.ac.za:11427/29431
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:33:48.261Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2019
publishDateRange	2019
publishDateSort	2019
publisher	Department of Statistical Sciences
publisherStr	Department of Statistical Sciences
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/29431 Recurrent neural network language models in the context of under-resourced South African languages Scarcella, Alessandro Lacerda, Miguel Statistics Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced languages, have received little attention and have yet to benefit from recent advances. This investigation aims to evaluate the implications of recent advances in neural network language modelling techniques for under-resourced South African languages. Rudimentary, single layered recurrent neural networks (RNN) were used to model four South African text corpora. The accuracy of these models were compared directly to legacy approaches. A suite of hybrid models was then tested. Across all four datasets, neural networks led to overall better performing language models either directly or as part of a hybrid model. A short examination of punctuation marks in text data revealed that performance metrics for language models are greatly overestimated when punctuation marks have not been excluded. The investigation concludes by appraising the sensitivity of RNN language models (RNNLMs) to the size of the datasets by artificially constraining the datasets and evaluating the accuracy of the models. It is recommended that future research endeavours within this domain are directed towards evaluating more sophisticated RNNLMs as well as measuring their impact on application focused tasks such as speech recognition and machine translation. 2019-02-08T13:55:47Z 2019-02-08T13:55:47Z 2018 2019-02-07T09:46:06Z Master Thesis Masters MSc http://hdl.handle.net/11427/29431 eng application/pdf Department of Statistical Sciences Faculty of Science University of Cape Town
spellingShingle	Statistics Scarcella, Alessandro Recurrent neural network language models in the context of under-resourced South African languages
thesis_degree_str	Master's
title	Recurrent neural network language models in the context of under-resourced South African languages
title_full	Recurrent neural network language models in the context of under-resourced South African languages
title_fullStr	Recurrent neural network language models in the context of under-resourced South African languages
title_full_unstemmed	Recurrent neural network language models in the context of under-resourced South African languages
title_short	Recurrent neural network language models in the context of under-resourced South African languages
title_sort	recurrent neural network language models in the context of under resourced south african languages
topic	Statistics
url	http://hdl.handle.net/11427/29431
work_keys_str_mv	AT scarcellaalessandro recurrentneuralnetworklanguagemodelsinthecontextofunderresourcedsouthafricanlanguages

Full Text Available

Recurrent neural network language models in the context of under-resourced South African languages

Similar Items