Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language

Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This...

Full description

Saved in:
Bibliographic Details
Main Author: Pedlar, Victoria
Other Authors: Britz, Stefan
Format: Thesis
Language:English
English
Published: Department of Statistical Sciences 2026
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613307107540992
access_status_str Open Access
author Pedlar, Victoria
author2 Britz, Stefan
author_browse Britz, Stefan
Pedlar, Victoria
author_facet Britz, Stefan
Pedlar, Victoria
author_sort Pedlar, Victoria
collection Thesis
description Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages.
format Thesis
id oai:open.uct.ac.za:11427/43141
institution University of Cape Town (South Africa)
language English
eng
last_indexed 2026-06-10T12:34:03.682Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Department of Statistical Sciences
publisherStr Department of Statistical Sciences
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/43141 Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language Pedlar, Victoria Britz, Stefan Buys, Jan Statistical Sciences isiZulu AWD-LSTM Transformer with NLL Loss Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. 2026-04-28T11:31:14Z 2026-04-28T11:31:14Z 2023 2026-04-28T11:21:45Z Thesis / Dissertation Masters Masters http://hdl.handle.net/11427/43141 en eng application/pdf Department of Statistical Sciences Faculty of Science University of Cape Town
spellingShingle Statistical Sciences
isiZulu
AWD-LSTM
Transformer with NLL Loss
Pedlar, Victoria
Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
thesis_degree_str Master's
title Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
title_full Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
title_fullStr Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
title_full_unstemmed Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
title_short Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language
title_sort open ended text generation in isizulu decoding strategies for a morphologically rich low resource language
topic Statistical Sciences
isiZulu
AWD-LSTM
Transformer with NLL Loss
url http://hdl.handle.net/11427/43141
work_keys_str_mv AT pedlarvictoria openendedtextgenerationinisizuludecodingstrategiesforamorphologicallyrichlowresourcelanguage