Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages

Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.

Saved in:

Bibliographic Details
Other Authors:	Marivate, Vukosi
Format:	Thesis
Language:	English
Published:	University of Pretoria 2023
Subjects:	UCTD Multilingual language modeling BantuBERTa Bantu Languages
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613441450049536
access_status_str	Open Access
author2	Marivate, Vukosi
author_browse	Marivate, Vukosi
author_facet	Marivate, Vukosi
collection	Thesis
dc_rights_str_mv	© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description	Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.
format	Thesis
id	oai:repository.up.ac.za:2263/92766
institution	University of Pretoria (South Africa)
language	English
last_indexed	2026-06-10T12:36:12.012Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate	2023
publishDateRange	2023
publishDateSort	2023
publisher	University of Pretoria
publisherStr	University of Pretoria
record_format	dspace
source_str	UPSpace — University of Pretoria Institutional Repository
spelling	oai:repository.up.ac.za:2263/92766 BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages Marivate, Vukosi Akinyi, Verrah Parvess, Jesse UCTD Multilingual language modeling BantuBERTa Bantu Languages Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023. It was researched whether a multilingual Bantu pretraining corpus could be created from freely available data. Here, to create the dataset, Bantu text extracted from datasets that are freely available online (mainly from Huggingface) were used. The resulting multilingual language model (BantuBERTa) from this pretraining data proved to be predictive across multiple Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification). This proves that this dataset can be used for Bantu multilingual pretraining and transfer to multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30% smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in the NER test. We believe this is a case-specific failure due to poor data quality resulting from a pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification, pretraining on languages solely within the language family seemed to benefit transfer to other similar languages within the family. This potentially opens a method for effectively including low-resourced languages in low-level NLP tasks. Computer Science MIT (Big Data Science) Unrestricted 2023-10-09T08:00:41Z 2023-10-09T08:00:41Z 2023-04 2023 Mini Dissertation * A2023 http://hdl.handle.net/2263/92766 en © 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle	UCTD Multilingual language modeling BantuBERTa Bantu Languages BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title_full	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title_fullStr	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title_full_unstemmed	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title_short	BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
title_sort	bantuberta using language family grouping in multilingual language modeling for bantu languages
topic	UCTD Multilingual language modeling BantuBERTa Bantu Languages
url	http://hdl.handle.net/2263/92766

Full Text Available

BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages

Similar Items