Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Language identification in a highly unbalanced dataset

Thesis (MEng) -- Stellenbosch University, 2022.

Saved in:

Bibliographic Details
Main Author:	Weber, Ian
Other Authors:	Niesler, Thomas
Format:	Thesis
Language:	en_ZA
Published:	Stellenbosch : Stellenbosch University 2022
Subjects:	Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613816915755008
access_status_str	Open Access
author	Weber, Ian
author2	Niesler, Thomas
author_browse	Niesler, Thomas Weber, Ian
author_facet	Niesler, Thomas Weber, Ian
author_sort	Weber, Ian
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng) -- Stellenbosch University, 2022.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/126015
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:42:09.814Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2022
publishDateRange	2022
publishDateSort	2022
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/126015 Language identification in a highly unbalanced dataset Weber, Ian Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD Thesis (MEng) -- Stellenbosch University, 2022. ENGLISH ABSTRACT: Automated web scraping algorithms play a key role in the creation of text-based corpora. Part of the web scraping pipeline is to identify the language of the text that is acquired. Most research performed focuses on well-resourced languages leaving many languages under-resourced. In this thesis, we design and test two di erent feature extraction and machine learning algorithm combinations. By focusing on classifying one target language with high precision and a small seed corpus, we can improve the ability to automatically acquire text for that speci c language. We compiled a sub-corpus using the Leipzig Corpora Collection. This data consists mostly of scraped news, web and Wikipedia content. From this sub-corpus, we designed a set of experiments to compare two machine learning algorithms and observe their performance. This was achieved by reducing the number of training sentences in the target language. Furthermore, we removed some languages entirely from the training set to observe the e ect on model performance. Cross-validation was used to train multiple models and gain a more accurate estimation of model performance. We designed a model using n features and a logistic regression classi er to function as a baseline. We also designed a model using Word2Vec features and a long short term memory (LSTM) classi er. The latter model is a more modern architecture and is frequently used in text classi cation tasks. It was found that the n-gram logistic regression models performed better when training resources are very scarce. This is an expected result and consistent with previously performed work. When given more training resources, however, the Word2Vec LSTM models achieved impressive results with excellent class separation. AFRIKAANS OPSOMMING: Outomatiese webskraapalgoritmes speel 'n sleutelrol in die skepping van teksgebaseerde korpusse. Deel van die webskraappyplyn, is om die taal van die teks wat ingesamel word, te identi seer. Die meeste navorsing wat uitgevoer word, fokus op tale met goeie hulpbronne. Minder algemene tale met minder beskikbare hulpbronne word dus agterwe e gelaat in die proses. In hierdie tesis ontwerp en toets ons twee verskillende kenmerk-onttrekking en masjienleer-algoritme kombinasies. Deur te fokus op die klassi kasie van een doeltaal, met ho e akkuraatheid en 'n klein saadkorpus, kan ons die vermo e om outomaties teks vir daardie spesi eke taal te verkry verbeter. Ons het 'n sub-korpus saamgestel, deur die Leipzig Corpora-versameling te gebruik. Hierdie data bestaan meestal uit geskraapte nuus, web- en Wikipedia-inhoud. Uit hierdie sub-korpus het ons 'n stel eksperimente ontwerp om twee masjienleeralgoritmes te vergelyk, en die resultate waar te neem. Dit is bereik deur die aantal oefensinne in die doeltaal te verminder. Verder het ons sommige tale heeltemal uit die opleidingstel verwyder om die e ek op modelprestasie waar te neem. Kruisvalidering is gebruik om veelvuldige modelle op te lei en 'n meer akkurate skatting van modelprestasie te kry. Ons het 'n model ontwerp wat n kenmerke en 'n logistiese regressieklassi seerder gebruik om as 'n basislyn te funksioneer. Ons het ook 'n model ontwerp wat Word2Vec-kenmerke en 'n \long term short memory" (LSTM) klassi seerder gebruik. Laasgenoemde model is 'n meer moderne argitektuur, en word gereeld in teksklassi kasietake gebruik. Daar is gevind dat die n-gram logistiese regressiemodelle beter presteer wanneer opleidingshulpbronne baie skaars is. Dit is 'n verwagte resultaat en stem ooreen met voorheen uitgevoerde werk. Toe die Word2Vec LSTM-modelle egter meer opleidingshulpbronne gegee is, is indrukwekkende resultate behaal met uitstekende klasskeiding. Masters 2022-11-16T07:09:01Z 2023-01-16T12:45:47Z 2022-11-16T07:09:01Z 2023-01-16T12:45:47Z 2022-12 Thesis http://hdl.handle.net/10019.1/126015 en_ZA Stellenbosch University xiii, 107 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD Weber, Ian Language identification in a highly unbalanced dataset
title	Language identification in a highly unbalanced dataset
title_full	Language identification in a highly unbalanced dataset
title_fullStr	Language identification in a highly unbalanced dataset
title_full_unstemmed	Language identification in a highly unbalanced dataset
title_short	Language identification in a highly unbalanced dataset
title_sort	language identification in a highly unbalanced dataset
topic	Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD
url	http://hdl.handle.net/10019.1/126015
work_keys_str_mv	AT weberian languageidentificationinahighlyunbalanceddataset

Full Text Available

Language identification in a highly unbalanced dataset

Similar Items