Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Language specific web-crawling

Thesis (MEng)--Stellenbosch University, 2025.

Saved in:
Bibliographic Details
Main Author: Schillack, Erwin Andreas
Other Authors: Niesler, T. R. (Thomas)
Format: Thesis
Published: Stellenbosch : Stellenbosch University 2026
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614013956816896
access_status_str Open Access
author Schillack, Erwin Andreas
author2 Niesler, T. R. (Thomas)
author_browse Niesler, T. R. (Thomas)
Schillack, Erwin Andreas
author_facet Niesler, T. R. (Thomas)
Schillack, Erwin Andreas
author_sort Schillack, Erwin Andreas
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2025.
format Thesis
id oai:scholar.sun.ac.za:10019.1/134807
institution Stellenbosch University (South Africa)
last_indexed 2026-06-10T12:45:17.761Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/134807 Language specific web-crawling Schillack, Erwin Andreas Niesler, T. R. (Thomas) Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Natural language processing (Computer science) Corpora (Linguistics) Low-resource languages Linguistic analysis (Linguistics) Thesis (MEng)--Stellenbosch University, 2025. Schillack, E. A. 2025. Language specific web-crawling. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/b24398f6-a53d-4d82-9deb-03c094b7caa3 ENGLISH ABSTRACT: Low resource language modelling remains a challenge in natural language processing, particularly for low resource languages with a limited digital presence. This thesis investigates sentence level language identification when given very short and noisy text inputs from a diverse language set, including languages unseen at training time. We assemble a controlled multilingual corpus spanning 122 languages by combining sources (Leipzig Corpora Collection and OPUS) and broadening coverage through web crawling with Apache Nutch and Common Crawl. We introduce Lang2Vec, a language level embedding learned inside a single shared Distributed Memory (Doc2Vec DM) model. Each sentence carries a persistent language tag whose vector is updated whenever that language appears, yielding one embedding per language in a shared space. We evaluate Lang2Vec against word level baselines in two experiments where we measure data efficiency, and unknown language rejection with removed languages. AFRIKAANSE OPSOMMING: Laehulpbrontaalmodellering bly ’n sentrale uitdaging in natuurlike taalverwerking (NLP), veral skaars tale met ’n beperkte digitale teenwoordigheid. Hierdie tesis ondersoek taalidentifikasie op sinsvlak vir baie kort en vuil teksinsette uit ’n diverse stel tale, insluitend tale wat nie tydens opleiding gesien is nie. Ons stel ’n beheerde veeltalige korpus saam wat uit 122 tale bestaan met die kombineering van verskillende bronne (Leipzig Corpora Collection en OPUS) en met verder uitbreiding gebruik ons web kruiping met Apache Nutch en Common Crawl. Ons stel Lang2Vec bekend, ’n taalvlakinbedding wat binne ’n enkele, gedeelde Distributed Memory (Doc2Vec DM) model geleer word. Elke sin dra ’n permanente taalmerker waarvan die vektor bygewerk word wanneer daardie taal voorkom, wat ’n enkele inbedding per taal in ’n gedeelde ruimte lewer. Ons evalueer Lang2Vec teenoor woordvlakbasislyne in twee eksperimente waar ons datadoeltreffendheid deur middel van inkrementele data byvoeging meet, en onbekendetaalverwerping toets deur hele tale tydens opleiding te verwyder. Masters 2026-01-08T12:40:46Z 2026-01-08T12:40:46Z 2025-12 Thesis https://scholar.sun.ac.za/handle/10019.1/134807 Stellenbosch University xiii, 135 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Natural language processing (Computer science)
Corpora (Linguistics)
Low-resource languages
Linguistic analysis (Linguistics)
Schillack, Erwin Andreas
Language specific web-crawling
title Language specific web-crawling
title_full Language specific web-crawling
title_fullStr Language specific web-crawling
title_full_unstemmed Language specific web-crawling
title_short Language specific web-crawling
title_sort language specific web crawling
topic Natural language processing (Computer science)
Corpora (Linguistics)
Low-resource languages
Linguistic analysis (Linguistics)
url https://scholar.sun.ac.za/handle/10019.1/134807
work_keys_str_mv AT schillackerwinandreas languagespecificwebcrawling