Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng) -- Stellenbosch University, 2022.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | en_ZA |
| Published: |
Stellenbosch : Stellenbosch University
2022
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613816915755008 |
|---|---|
| access_status_str | Open Access |
| author | Weber, Ian |
| author2 | Niesler, Thomas |
| author_browse | Niesler, Thomas Weber, Ian |
| author_facet | Niesler, Thomas Weber, Ian |
| author_sort | Weber, Ian |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MEng) -- Stellenbosch University, 2022. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/126015 |
| institution | Stellenbosch University (South Africa) |
| language | en_ZA |
| last_indexed | 2026-06-10T12:42:09.814Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2022 |
| publishDateRange | 2022 |
| publishDateSort | 2022 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/126015 Language identification in a highly unbalanced dataset Weber, Ian Niesler, Thomas Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD Thesis (MEng) -- Stellenbosch University, 2022. ENGLISH ABSTRACT: Automated web scraping algorithms play a key role in the creation of text-based corpora. Part of the web scraping pipeline is to identify the language of the text that is acquired. Most research performed focuses on well-resourced languages leaving many languages under-resourced. In this thesis, we design and test two di erent feature extraction and machine learning algorithm combinations. By focusing on classifying one target language with high precision and a small seed corpus, we can improve the ability to automatically acquire text for that speci c language. We compiled a sub-corpus using the Leipzig Corpora Collection. This data consists mostly of scraped news, web and Wikipedia content. From this sub-corpus, we designed a set of experiments to compare two machine learning algorithms and observe their performance. This was achieved by reducing the number of training sentences in the target language. Furthermore, we removed some languages entirely from the training set to observe the e ect on model performance. Cross-validation was used to train multiple models and gain a more accurate estimation of model performance. We designed a model using n features and a logistic regression classi er to function as a baseline. We also designed a model using Word2Vec features and a long short term memory (LSTM) classi er. The latter model is a more modern architecture and is frequently used in text classi cation tasks. It was found that the n-gram logistic regression models performed better when training resources are very scarce. This is an expected result and consistent with previously performed work. When given more training resources, however, the Word2Vec LSTM models achieved impressive results with excellent class separation. AFRIKAANS OPSOMMING: Outomatiese webskraapalgoritmes speel 'n sleutelrol in die skepping van teksgebaseerde korpusse. Deel van die webskraappyplyn, is om die taal van die teks wat ingesamel word, te identi seer. Die meeste navorsing wat uitgevoer word, fokus op tale met goeie hulpbronne. Minder algemene tale met minder beskikbare hulpbronne word dus agterwe e gelaat in die proses. In hierdie tesis ontwerp en toets ons twee verskillende kenmerk-onttrekking en masjienleer-algoritme kombinasies. Deur te fokus op die klassi kasie van een doeltaal, met ho e akkuraatheid en 'n klein saadkorpus, kan ons die vermo e om outomaties teks vir daardie spesi eke taal te verkry verbeter. Ons het 'n sub-korpus saamgestel, deur die Leipzig Corpora-versameling te gebruik. Hierdie data bestaan meestal uit geskraapte nuus, web- en Wikipedia-inhoud. Uit hierdie sub-korpus het ons 'n stel eksperimente ontwerp om twee masjienleeralgoritmes te vergelyk, en die resultate waar te neem. Dit is bereik deur die aantal oefensinne in die doeltaal te verminder. Verder het ons sommige tale heeltemal uit die opleidingstel verwyder om die e ek op modelprestasie waar te neem. Kruisvalidering is gebruik om veelvuldige modelle op te lei en 'n meer akkurate skatting van modelprestasie te kry. Ons het 'n model ontwerp wat n kenmerke en 'n logistiese regressieklassi seerder gebruik om as 'n basislyn te funksioneer. Ons het ook 'n model ontwerp wat Word2Vec-kenmerke en 'n \long term short memory" (LSTM) klassi seerder gebruik. Laasgenoemde model is 'n meer moderne argitektuur, en word gereeld in teksklassi kasietake gebruik. Daar is gevind dat die n-gram logistiese regressiemodelle beter presteer wanneer opleidingshulpbronne baie skaars is. Dit is 'n verwagte resultaat en stem ooreen met voorheen uitgevoerde werk. Toe die Word2Vec LSTM-modelle egter meer opleidingshulpbronne gegee is, is indrukwekkende resultate behaal met uitstekende klasskeiding. Masters 2022-11-16T07:09:01Z 2023-01-16T12:45:47Z 2022-11-16T07:09:01Z 2023-01-16T12:45:47Z 2022-12 Thesis http://hdl.handle.net/10019.1/126015 en_ZA Stellenbosch University xiii, 107 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD Weber, Ian Language identification in a highly unbalanced dataset |
| title | Language identification in a highly unbalanced dataset |
| title_full | Language identification in a highly unbalanced dataset |
| title_fullStr | Language identification in a highly unbalanced dataset |
| title_full_unstemmed | Language identification in a highly unbalanced dataset |
| title_short | Language identification in a highly unbalanced dataset |
| title_sort | language identification in a highly unbalanced dataset |
| topic | Corpora (Linguistics) Machine learning Data set Language identification Computational linguistics UCTD |
| url | http://hdl.handle.net/10019.1/126015 |
| work_keys_str_mv | AT weberian languageidentificationinahighlyunbalanceddataset |