Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Exploring the class imbalance problem in text classification

Thesis (MCom)--Stellenbosch University, 2023.

Saved in:

Bibliographic Details
Main Author:	Bezuidenhout, Jean-Pierre
Other Authors:	Lamont, Morne
Format:	Thesis
Language:	en_ZA
Published:	Stellenbosch : Stellenbosch University 2023
Subjects:	Natural language processing (Computer science) Machine learning Human-computer interaction Text classification UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613742507753472
access_status_str	Open Access
author	Bezuidenhout, Jean-Pierre
author2	Lamont, Morne
author_browse	Bezuidenhout, Jean-Pierre Lamont, Morne
author_facet	Lamont, Morne Bezuidenhout, Jean-Pierre
author_sort	Bezuidenhout, Jean-Pierre
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MCom)--Stellenbosch University, 2023.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/126960
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:40:58.715Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2023
publishDateRange	2023
publishDateSort	2023
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/126960 Exploring the class imbalance problem in text classification Bezuidenhout, Jean-Pierre Lamont, Morne Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Natural language processing (Computer science) Machine learning Human-computer interaction Text classification UCTD Thesis (MCom)--Stellenbosch University, 2023. ENGLISH SUMMARY: Natural Language Processing (NLP) is a subfield in computer science which is focused on leveraging computers to learn from human language. Over the years, the field has been used to perform a wide variety of tasks which have resulted in many interesting real-world applications. One of these tasks is text classification, where the focus is on the development of models which are able to successfully predict the class label for textual inputs from a set of pre-defined category labels. Text classification has previously been applied in the development of automatic spam detection systems and in the analysis of consumer sentiment. Unfortunately, many real-world text data have an imbalanced class label distribution. This is often the case for spam data sets, where the majority of observations are labelled as non-spam. In the development of an automatic spam detection system, we want the system to correctly identify spam instances. However, traditional Machine Learning (ML) models are usually overwhelmed by instances in the majority class, which hinders the ability of these models to correctly identify instances in the minority class. The field of imbalanced learning is focused on the manipulation of data and algorithms to address the problem that was just described. However, these methods have not been thoroughly explored in the literature. Thus, our objective in this thesis is to contribute new knowledge to the problem of imbalanced class label distributions in the context of text classification. The problem is approached by reviewing the literature to identify ML models which were previously applied to text classification tasks. Furthermore, methods are identified from the literature which manipulate data and algorithms which are well suited to the task of imbalanced learning. The performance of these techniques is investigated by means of an empirical study which focused on real-world movie review data. Simulated scenarios with varying degrees of class imbalance are investigated in order to study the robustness of classifiers on imbalanced data problems, and to analyse the performance of imbalanced learning techniques. For the data set that was analysed, the results from our findings suggest that some classifiers are more robust to class imbalance than others, and that performance gains are possible when imbalanced learning techniques are included in the learning process. AFRIKAANSE OPSOMMING: Natuurlike taalverwerking is ’n subveld in rekenaarwetenskap wat daarop gefokus is om rekenaars te gebruik om uit menslike taal te leer. Oor die jare is die veld gebruik om ’n wye verskeidenheid uit te voer wat gelei het tot baie interessante regte-wereld toepassings. Een van hierdie take is teksklassifikasie, waar die fokus is op die ontwikkeling van modelle wat in staat is om die klastoekenning suksesvol te voorspel vir teksinsette vanaf ’n stel vooraf gedefinieerde klasse. Teksklassifikasie is voorheen toegepas in die ontwikkeling van outomatiese gemorsposopsporingstelsels en in die ontleding van verbruikersentiment. Ongelukkig het baie regte-wereld teksdata ’n ongebalanseerde klasverspreiding. Dit is dikwels die geval vir gemorsposdatastelle, waar die meeste waarnemings as nie-gemorspos bestempel word. In die ontwikkeling van ’n outomatiese gemorsposopsporingstelsel wil ons he dat die stelsel gemorsposgevalle korrek identifiseer. Tradisionele Masjienleer (ML) modelle word egter gewoonlik oorweldig deur gevalle in die meerderheidsklas, wat die vermoe van hierdie modelle belemmer om gevalle in die minderheidsklas korrek te identifiseer. Die veld van ongebalanseerde leer is gefokus op die manipulering van data en algoritmes om die probleem wat pas beskryf is, aan te spreek. Hierdie metodes is egter nie deeglik in die literatuur ondersoek nie. Ons doelwit in hierdie tesis is dus om nuwe kennis by te dra tot die probleem van ongebalanseerde klasverspreidings in die konteks van teksklassifikasie. Die probleem word benader deur die literatuur te hersien om ML-modelle te identifiseer wat voorheen op teksklassifikasietake toegepas is. Verder word metodes uit die literatuur geidentifiseer wat data en algoritmes manipuleer wat goed geskik is vir die taak van ongebalanseerde leer. Die prestasie van hierdie tegnieke word ondersoek deur middel van ’n empiriese studie wat gefokus het op werklike filmresensiedata. Gesimuleerde scenario’s met verskillende grade van klaswanbalans word ondersoek om die robuustheid van klassifiseerders op ongebalanseerde dataprobleme te bestudeer, en om die prestasie van ongebalanseerde leertegnieke te ontleed. Vir die datastel wat ontleed is, dui die resultate van ons bevindinge daarop dat sommige klassifiseerders meer robuust is vir klaswanbalans as antler, en dat prestasietoenames moontlik is wanneer ongebalanseerde leertegnieke by die leerproses ingesluit word. Masters 2023-02-03T09:30:33Z 2023-05-18T06:57:41Z 2023-02-03T09:30:33Z 2023-05-18T06:57:41Z 2023-03 Thesis http://hdl.handle.net/10019.1/126960 en_ZA Stellenbosch University xii, 83 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Natural language processing (Computer science) Machine learning Human-computer interaction Text classification UCTD Bezuidenhout, Jean-Pierre Exploring the class imbalance problem in text classification
title	Exploring the class imbalance problem in text classification
title_full	Exploring the class imbalance problem in text classification
title_fullStr	Exploring the class imbalance problem in text classification
title_full_unstemmed	Exploring the class imbalance problem in text classification
title_short	Exploring the class imbalance problem in text classification
title_sort	exploring the class imbalance problem in text classification
topic	Natural language processing (Computer science) Machine learning Human-computer interaction Text classification UCTD
url	http://hdl.handle.net/10019.1/126960
work_keys_str_mv	AT bezuidenhoutjeanpierre exploringtheclassimbalanceproblemintextclassification

Full Text Available

Exploring the class imbalance problem in text classification

Similar Items