Full Text Available

Access Repository Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages

Dissertation (MSc (Computer Science))--University of Pretoria, 2024.

Saved in:

Bibliographic Details
Other Authors:	Marivate, Vukosi
Format:	Thesis
Language:	English
Published:	University of Pretoria 2024
Subjects:	Natural language processing (NLP) Low resourced languages Monolingual embeddings Cross-lingual embeddings Language technologies
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613614167293952
access_status_str	Open Access
author2	Marivate, Vukosi
author_browse	Marivate, Vukosi
author_facet	Marivate, Vukosi
collection	Thesis
dc_rights_str_mv	© 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description	Dissertation (MSc (Computer Science))--University of Pretoria, 2024.
format	Thesis
id	oai:repository.up.ac.za:2263/99401
institution	University of Pretoria (South Africa)
language	English
last_indexed	2026-06-10T12:38:56.612Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate	2024
publishDateRange	2024
publishDateSort	2024
publisher	University of Pretoria
publisherStr	University of Pretoria
record_format	dspace
source_str	UPSpace — University of Pretoria Institutional Repository
spelling	oai:repository.up.ac.za:2263/99401 Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages Marivate, Vukosi sindane.thapelo@tuks.co.za Modupe, Abiodun Sindane, Thapelo Andrew Natural language processing (NLP) Low resourced languages Monolingual embeddings Cross-lingual embeddings Language technologies Dissertation (MSc (Computer Science))--University of Pretoria, 2024. The world continues to witness increasingly complex technological, economic, and societal advancements at an accelerated pace in the space of Natural Language Processing (NLP) and Artificial Intelligence (AI). The availability of massive digital data in various forms such as language data, image data, and numeric data plays a profound role in supporting this upward trend. For example, the availability of tremendous volumes of English data and other high internet prevalent languages unlocks the ability to develop high-quality language technologies such as Generative AI systems, Question Answering systems, Translation systems, and other societally impactful technologies we see today. This new era unfolds a simple yet efficacious equation that takes the form (increased datasets = increased performance) operating with proportionality mechanics. Despite the remarkable strides, a concerning consequence has emerged $ - $ a widening horizontal divide among globally spoken languages. A divide that highlights disparities of benefits from available language technologies across the 7000-plus spoken languages. Key impedes that emerge in addressing such disparities for the underserved languages include data availability, data benchmarking, scaling, internet prevalence, sustainable pipelines, coverage, and lack of expertise. In this work, we extensively scrutinize some of these concerns by first grounding our work in the context of South African languages. South Africa has 12 official languages with varying states of resource-prevalence which provided a perfect case to demonstrate our proposed remedial approaches. To address benchmarking we proposed standard datasets for all spoken languages; Scaling is addressed by showcasing the use of bilingual lexicons as a resource with much higher linguistic coverage to define various techniques that continuously improve our machine learning models; and Coverage is demonstrated by accounting for all South African languages in the development of technologies. The main objective of this thesis is to investigate cross-lingual embeddings as cheaper interventions to administer transfer capabilities of various machine learning models across various downstream tasks, in order to foster the development, and accessibility of local technologies for low-resourced languages. Cross-lingual embeddings are intra-semantic and inter-translation equivalent representations between high-resourced and low-resourced languages. For this work, these cross-lingual embeddings have demonstrated efficacy in tasks such as News Headlines Classification (NHC), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), and have shown great potential for the development of localized technologies. The investigations showed that training NLP models with cross-lingual embeddings enhances both transfer and learning-from-scratch capabilities compared to monolingual embedding training. This study also highlighted that increasing supervision signals such as bilingual lexicons for training cross-lingual embeddings also improves their performance. Furthermore, our investigations indicated that no single cross-lingual model works well across all languages. We were able to address 4 key performance point and we hope the interventions proposed in this study will have a positive impact on the socio-economic status of South Africa and can be scaled to other contexts to empower societies and businesses. Mastercard Scholarship Foundation ABSA Chair For Data Science Computer Science MSc (Computer Science) Unrestricted Faculty of Engineering, Built Environment and Information Technology SDG-04: Quality education 2024-11-26T09:29:46Z 2024-11-26T09:29:46Z 2025-04-20 2024-09-10 Dissertation * http://hdl.handle.net/2263/99401 https://doi.org/10.25403/UPresearchdata.27002596 en © 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle	Natural language processing (NLP) Low resourced languages Monolingual embeddings Cross-lingual embeddings Language technologies Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title	Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_full	Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_fullStr	Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_full_unstemmed	Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_short	Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_sort	harnessing cross lingual transfer learning techniques to facilitate interventions for low resourced languages
topic	Natural language processing (NLP) Low resourced languages Monolingual embeddings Cross-lingual embeddings Language technologies
url	http://hdl.handle.net/2263/99401 https://doi.org/10.25403/UPresearchdata.27002596

Full Text Available

Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages

Similar Items