Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This...
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English English |
| Published: |
Department of Statistical Sciences
2026
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613307107540992 |
|---|---|
| access_status_str | Open Access |
| author | Pedlar, Victoria |
| author2 | Britz, Stefan |
| author_browse | Britz, Stefan Pedlar, Victoria |
| author_facet | Britz, Stefan Pedlar, Victoria |
| author_sort | Pedlar, Victoria |
| collection | Thesis |
| description | Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. |
| format | Thesis |
| id | oai:open.uct.ac.za:11427/43141 |
| institution | University of Cape Town (South Africa) |
| language | English eng |
| last_indexed | 2026-06-10T12:34:03.682Z |
| license_str | Not specified — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository |
| publishDate | 2026 |
| publishDateRange | 2026 |
| publishDateSort | 2026 |
| publisher | Department of Statistical Sciences |
| publisherStr | Department of Statistical Sciences |
| record_format | dspace |
| source_str | UCTD — University of Cape Town Open Access Repository |
| spelling | oai:open.uct.ac.za:11427/43141 Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language Pedlar, Victoria Britz, Stefan Buys, Jan Statistical Sciences isiZulu AWD-LSTM Transformer with NLL Loss Generating high-quality text in under-resourced and morphologically complex languages like isiZulu is vital for natural language processing advancements, yet such languages remain underexplored. Addressing this challenge could improve text generation performance and enable broader applications. This study aims to investigate and evaluate various text generation techniques for isiZulu while addressing the challenges that come with it. Three models (AWD-LSTM, Transformer with NLL Loss, and Transformer with Entmax Loss) were assessed using decoding strategies like greedy decoding, beam search, nucleus sampling, Top-k sampling, temperature sampling, and ↵-Entmax sampling. The evaluation involved ✏-perplexity, BLEU, chrF++, CER, and Distinct-2 metrics. The AWD-LSTM model achieved optimal performance with temperature sampling at t = 0.7, while the Transformer with NLL Loss excelled using nucleus sampling at p = 0.90. The Transformer with Entmax Loss, a novel sparse language model, reached maximum diversity with ↵-Entmax sampling at ↵ = 1.2. The Entmax-based sparse language model demonstrates potential in effectively handling the challenges posed by languages like isiZulu, offering a potential alternative to softmax for enhancing text generation performance. This study's insights could inform future research on developing more effective and diverse text generation techniques for isiZulu and other morphologically rich, low-resource languages. 2026-04-28T11:31:14Z 2026-04-28T11:31:14Z 2023 2026-04-28T11:21:45Z Thesis / Dissertation Masters Masters http://hdl.handle.net/11427/43141 en eng application/pdf Department of Statistical Sciences Faculty of Science University of Cape Town |
| spellingShingle | Statistical Sciences isiZulu AWD-LSTM Transformer with NLL Loss Pedlar, Victoria Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| thesis_degree_str | Master's |
| title | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| title_full | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| title_fullStr | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| title_full_unstemmed | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| title_short | Open-ended text generation in isiZulu: decoding strategies for a morphologically rich low-resource language |
| title_sort | open ended text generation in isizulu decoding strategies for a morphologically rich low resource language |
| topic | Statistical Sciences isiZulu AWD-LSTM Transformer with NLL Loss |
| url | http://hdl.handle.net/11427/43141 |
| work_keys_str_mv | AT pedlarvictoria openendedtextgenerationinisizuludecodingstrategiesforamorphologicallyrichlowresourcelanguage |