Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

South African isiZulu and siSwati news corpus creation, annotation and categorisation

Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022.

Saved in:
Bibliographic Details
Other Authors: Marivate, Vukosi
Format: Thesis
Language:English
Published: University of Pretoria 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613688815419392
access_status_str Open Access
author2 Marivate, Vukosi
author_browse Marivate, Vukosi
author_facet Marivate, Vukosi
collection Thesis
dc_rights_str_mv © 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022.
format Thesis
id oai:repository.up.ac.za:2263/92767
institution University of Pretoria (South Africa)
language English
last_indexed 2026-06-10T12:40:07.894Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher University of Pretoria
publisherStr University of Pretoria
record_format dspace
source_str UPSpace — University of Pretoria Institutional Repository
spelling oai:repository.up.ac.za:2263/92767 South African isiZulu and siSwati news corpus creation, annotation and categorisation Marivate, Vukosi u18114564@tuks.co.za Adendorff, M. Madodonga, Andani UCTD South African Local Languages Low Resources Languages Data Augmentation Topic Classification Logistic regression Engineering, built environment and information technology theses SDG-04 Engineering, built environment and information technology theses SDG-09 Engineering, built environment and information technology theses SDG-10 Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022. South Africa has eleven official languages and amongst the eleven languages only 9 languages are local low-resourced languages. As a result, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this project, the focus was to create annotated datasets for the isiZulu and siSwati local languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these local South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed better than the other combinations. bs2026 Computer Science MIT (Big Data Science) Unrestricted SDG-04: Quality education SDG-09: Industry, innovation and infrastructure SDG-10: Reduced inequalities 2023-10-09T08:01:33Z 2023-10-09T08:01:33Z 2023-04 2022 Mini Dissertation * A2023 http://hdl.handle.net/2263/92767 en © 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle UCTD
South African Local Languages
Low Resources Languages
Data Augmentation
Topic Classification
Logistic regression
Engineering, built environment and information technology theses SDG-04
Engineering, built environment and information technology theses SDG-09
Engineering, built environment and information technology theses SDG-10
South African isiZulu and siSwati news corpus creation, annotation and categorisation
title South African isiZulu and siSwati news corpus creation, annotation and categorisation
title_full South African isiZulu and siSwati news corpus creation, annotation and categorisation
title_fullStr South African isiZulu and siSwati news corpus creation, annotation and categorisation
title_full_unstemmed South African isiZulu and siSwati news corpus creation, annotation and categorisation
title_short South African isiZulu and siSwati news corpus creation, annotation and categorisation
title_sort south african isizulu and siswati news corpus creation annotation and categorisation
topic UCTD
South African Local Languages
Low Resources Languages
Data Augmentation
Topic Classification
Logistic regression
Engineering, built environment and information technology theses SDG-04
Engineering, built environment and information technology theses SDG-09
Engineering, built environment and information technology theses SDG-10
url http://hdl.handle.net/2263/92767