Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng)--Stellenbosch University, 2025.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Published: |
Stellenbosch : Stellenbosch University
2026
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867614013956816896 |
|---|---|
| access_status_str | Open Access |
| author | Schillack, Erwin Andreas |
| author2 | Niesler, T. R. (Thomas) |
| author_browse | Niesler, T. R. (Thomas) Schillack, Erwin Andreas |
| author_facet | Niesler, T. R. (Thomas) Schillack, Erwin Andreas |
| author_sort | Schillack, Erwin Andreas |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MEng)--Stellenbosch University, 2025. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/134807 |
| institution | Stellenbosch University (South Africa) |
| last_indexed | 2026-06-10T12:45:17.761Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2026 |
| publishDateRange | 2026 |
| publishDateSort | 2026 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/134807 Language specific web-crawling Schillack, Erwin Andreas Niesler, T. R. (Thomas) Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Natural language processing (Computer science) Corpora (Linguistics) Low-resource languages Linguistic analysis (Linguistics) Thesis (MEng)--Stellenbosch University, 2025. Schillack, E. A. 2025. Language specific web-crawling. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/b24398f6-a53d-4d82-9deb-03c094b7caa3 ENGLISH ABSTRACT: Low resource language modelling remains a challenge in natural language processing, particularly for low resource languages with a limited digital presence. This thesis investigates sentence level language identification when given very short and noisy text inputs from a diverse language set, including languages unseen at training time. We assemble a controlled multilingual corpus spanning 122 languages by combining sources (Leipzig Corpora Collection and OPUS) and broadening coverage through web crawling with Apache Nutch and Common Crawl. We introduce Lang2Vec, a language level embedding learned inside a single shared Distributed Memory (Doc2Vec DM) model. Each sentence carries a persistent language tag whose vector is updated whenever that language appears, yielding one embedding per language in a shared space. We evaluate Lang2Vec against word level baselines in two experiments where we measure data efficiency, and unknown language rejection with removed languages. AFRIKAANSE OPSOMMING: Laehulpbrontaalmodellering bly ’n sentrale uitdaging in natuurlike taalverwerking (NLP), veral skaars tale met ’n beperkte digitale teenwoordigheid. Hierdie tesis ondersoek taalidentifikasie op sinsvlak vir baie kort en vuil teksinsette uit ’n diverse stel tale, insluitend tale wat nie tydens opleiding gesien is nie. Ons stel ’n beheerde veeltalige korpus saam wat uit 122 tale bestaan met die kombineering van verskillende bronne (Leipzig Corpora Collection en OPUS) en met verder uitbreiding gebruik ons web kruiping met Apache Nutch en Common Crawl. Ons stel Lang2Vec bekend, ’n taalvlakinbedding wat binne ’n enkele, gedeelde Distributed Memory (Doc2Vec DM) model geleer word. Elke sin dra ’n permanente taalmerker waarvan die vektor bygewerk word wanneer daardie taal voorkom, wat ’n enkele inbedding per taal in ’n gedeelde ruimte lewer. Ons evalueer Lang2Vec teenoor woordvlakbasislyne in twee eksperimente waar ons datadoeltreffendheid deur middel van inkrementele data byvoeging meet, en onbekendetaalverwerping toets deur hele tale tydens opleiding te verwyder. Masters 2026-01-08T12:40:46Z 2026-01-08T12:40:46Z 2025-12 Thesis https://scholar.sun.ac.za/handle/10019.1/134807 Stellenbosch University xiii, 135 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Natural language processing (Computer science) Corpora (Linguistics) Low-resource languages Linguistic analysis (Linguistics) Schillack, Erwin Andreas Language specific web-crawling |
| title | Language specific web-crawling |
| title_full | Language specific web-crawling |
| title_fullStr | Language specific web-crawling |
| title_full_unstemmed | Language specific web-crawling |
| title_short | Language specific web-crawling |
| title_sort | language specific web crawling |
| topic | Natural language processing (Computer science) Corpora (Linguistics) Low-resource languages Linguistic analysis (Linguistics) |
| url | https://scholar.sun.ac.za/handle/10019.1/134807 |
| work_keys_str_mv | AT schillackerwinandreas languagespecificwebcrawling |