Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages

Dissertation (MSc (Computer Science))--University of Pretoria, 2024.

Saved in:
Bibliographic Details
Other Authors: Marivate, Vukosi
Format: Thesis
Language:English
Published: University of Pretoria 2024
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613614167293952
access_status_str Open Access
author2 Marivate, Vukosi
author_browse Marivate, Vukosi
author_facet Marivate, Vukosi
collection Thesis
dc_rights_str_mv © 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description Dissertation (MSc (Computer Science))--University of Pretoria, 2024.
format Thesis
id oai:repository.up.ac.za:2263/99401
institution University of Pretoria (South Africa)
language English
last_indexed 2026-06-10T12:38:56.612Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate 2024
publishDateRange 2024
publishDateSort 2024
publisher University of Pretoria
publisherStr University of Pretoria
record_format dspace
source_str UPSpace — University of Pretoria Institutional Repository
spelling oai:repository.up.ac.za:2263/99401 Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages Marivate, Vukosi sindane.thapelo@tuks.co.za Modupe, Abiodun Sindane, Thapelo Andrew Natural language processing (NLP) Low resourced languages Monolingual embeddings Cross-lingual embeddings Language technologies Dissertation (MSc (Computer Science))--University of Pretoria, 2024. The world continues to witness increasingly complex technological, economic, and societal advancements at an accelerated pace in the space of Natural Language Processing (NLP) and Artificial Intelligence (AI). The availability of massive digital data in various forms such as language data, image data, and numeric data plays a profound role in supporting this upward trend. For example, the availability of tremendous volumes of English data and other high internet prevalent languages unlocks the ability to develop high-quality language technologies such as Generative AI systems, Question Answering systems, Translation systems, and other societally impactful technologies we see today. This new era unfolds a simple yet efficacious equation that takes the form (increased datasets = increased performance) operating with proportionality mechanics. Despite the remarkable strides, a concerning consequence has emerged $ - $ a widening horizontal divide among globally spoken languages. A divide that highlights disparities of benefits from available language technologies across the 7000-plus spoken languages. Key impedes that emerge in addressing such disparities for the underserved languages include data availability, data benchmarking, scaling, internet prevalence, sustainable pipelines, coverage, and lack of expertise. In this work, we extensively scrutinize some of these concerns by first grounding our work in the context of South African languages. South Africa has 12 official languages with varying states of resource-prevalence which provided a perfect case to demonstrate our proposed remedial approaches. To address benchmarking we proposed standard datasets for all spoken languages; Scaling is addressed by showcasing the use of bilingual lexicons as a resource with much higher linguistic coverage to define various techniques that continuously improve our machine learning models; and Coverage is demonstrated by accounting for all South African languages in the development of technologies. The main objective of this thesis is to investigate cross-lingual embeddings as cheaper interventions to administer transfer capabilities of various machine learning models across various downstream tasks, in order to foster the development, and accessibility of local technologies for low-resourced languages. Cross-lingual embeddings are intra-semantic and inter-translation equivalent representations between high-resourced and low-resourced languages. For this work, these cross-lingual embeddings have demonstrated efficacy in tasks such as News Headlines Classification (NHC), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), and have shown great potential for the development of localized technologies. The investigations showed that training NLP models with cross-lingual embeddings enhances both transfer and learning-from-scratch capabilities compared to monolingual embedding training. This study also highlighted that increasing supervision signals such as bilingual lexicons for training cross-lingual embeddings also improves their performance. Furthermore, our investigations indicated that no single cross-lingual model works well across all languages. We were able to address 4 key performance point and we hope the interventions proposed in this study will have a positive impact on the socio-economic status of South Africa and can be scaled to other contexts to empower societies and businesses. Mastercard Scholarship Foundation ABSA Chair For Data Science Computer Science MSc (Computer Science) Unrestricted Faculty of Engineering, Built Environment and Information Technology SDG-04: Quality education 2024-11-26T09:29:46Z 2024-11-26T09:29:46Z 2025-04-20 2024-09-10 Dissertation * http://hdl.handle.net/2263/99401 https://doi.org/10.25403/UPresearchdata.27002596 en © 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle Natural language processing (NLP)
Low resourced languages
Monolingual embeddings
Cross-lingual embeddings
Language technologies
Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_full Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_fullStr Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_full_unstemmed Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_short Harnessing cross-lingual transfer learning techniques to facilitate interventions for low-resourced languages
title_sort harnessing cross lingual transfer learning techniques to facilitate interventions for low resourced languages
topic Natural language processing (NLP)
Low resourced languages
Monolingual embeddings
Cross-lingual embeddings
Language technologies
url http://hdl.handle.net/2263/99401
https://doi.org/10.25403/UPresearchdata.27002596