Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Exploring the class imbalance problem in text classification

Thesis (MCom)--Stellenbosch University, 2023.

Saved in:
Bibliographic Details
Main Author: Bezuidenhout, Jean-Pierre
Other Authors: Lamont, Morne
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613742507753472
access_status_str Open Access
author Bezuidenhout, Jean-Pierre
author2 Lamont, Morne
author_browse Bezuidenhout, Jean-Pierre
Lamont, Morne
author_facet Lamont, Morne
Bezuidenhout, Jean-Pierre
author_sort Bezuidenhout, Jean-Pierre
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MCom)--Stellenbosch University, 2023.
format Thesis
id oai:scholar.sun.ac.za:10019.1/126960
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:40:58.715Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/126960 Exploring the class imbalance problem in text classification Bezuidenhout, Jean-Pierre Lamont, Morne Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Natural language processing (Computer science) Machine learning Human-computer interaction Text classification UCTD Thesis (MCom)--Stellenbosch University, 2023. ENGLISH SUMMARY: Natural Language Processing (NLP) is a subfield in computer science which is focused on leveraging computers to learn from human language. Over the years, the field has been used to perform a wide variety of tasks which have resulted in many interesting real-world applications. One of these tasks is text classification, where the focus is on the development of models which are able to successfully predict the class label for textual inputs from a set of pre-defined category labels. Text classification has previously been applied in the development of automatic spam detection systems and in the analysis of consumer sentiment. Unfortunately, many real-world text data have an imbalanced class label distribution. This is often the case for spam data sets, where the majority of observations are labelled as non-spam. In the development of an automatic spam detection system, we want the system to correctly identify spam instances. However, traditional Machine Learning (ML) models are usually overwhelmed by instances in the majority class, which hinders the ability of these models to correctly identify instances in the minority class. The field of imbalanced learning is focused on the manipulation of data and algorithms to address the problem that was just described. However, these methods have not been thoroughly explored in the literature. Thus, our objective in this thesis is to contribute new knowledge to the problem of imbalanced class label distributions in the context of text classification. The problem is approached by reviewing the literature to identify ML models which were previously applied to text classification tasks. Furthermore, methods are identified from the literature which manipulate data and algorithms which are well suited to the task of imbalanced learning. The performance of these techniques is investigated by means of an empirical study which focused on real-world movie review data. Simulated scenarios with varying degrees of class imbalance are investigated in order to study the robustness of classifiers on imbalanced data problems, and to analyse the performance of imbalanced learning techniques. For the data set that was analysed, the results from our findings suggest that some classifiers are more robust to class imbalance than others, and that performance gains are possible when imbalanced learning techniques are included in the learning process. AFRIKAANSE OPSOMMING: Natuurlike taalverwerking is ’n subveld in rekenaarwetenskap wat daarop gefokus is om rekenaars te gebruik om uit menslike taal te leer. Oor die jare is die veld gebruik om ’n wye verskeidenheid uit te voer wat gelei het tot baie interessante regte-wereld toepassings. Een van hierdie take is teksklassifikasie, waar die fokus is op die ontwikkeling van modelle wat in staat is om die klastoekenning suksesvol te voorspel vir teksinsette vanaf ’n stel vooraf gedefinieerde klasse. Teksklassifikasie is voorheen toegepas in die ontwikkeling van outomatiese gemorsposopsporingstelsels en in die ontleding van verbruikersentiment. Ongelukkig het baie regte-wereld teksdata ’n ongebalanseerde klasverspreiding. Dit is dikwels die geval vir gemorsposdatastelle, waar die meeste waarnemings as nie-gemorspos bestempel word. In die ontwikkeling van ’n outomatiese gemorsposopsporingstelsel wil ons he dat die stelsel gemorsposgevalle korrek identifiseer. Tradisionele Masjienleer (ML) modelle word egter gewoonlik oorweldig deur gevalle in die meerderheidsklas, wat die vermoe van hierdie modelle belemmer om gevalle in die minderheidsklas korrek te identifiseer. Die veld van ongebalanseerde leer is gefokus op die manipulering van data en algoritmes om die probleem wat pas beskryf is, aan te spreek. Hierdie metodes is egter nie deeglik in die literatuur ondersoek nie. Ons doelwit in hierdie tesis is dus om nuwe kennis by te dra tot die probleem van ongebalanseerde klasverspreidings in die konteks van teksklassifikasie. Die probleem word benader deur die literatuur te hersien om ML-modelle te identifiseer wat voorheen op teksklassifikasietake toegepas is. Verder word metodes uit die literatuur geidentifiseer wat data en algoritmes manipuleer wat goed geskik is vir die taak van ongebalanseerde leer. Die prestasie van hierdie tegnieke word ondersoek deur middel van ’n empiriese studie wat gefokus het op werklike filmresensiedata. Gesimuleerde scenario’s met verskillende grade van klaswanbalans word ondersoek om die robuustheid van klassifiseerders op ongebalanseerde dataprobleme te bestudeer, en om die prestasie van ongebalanseerde leertegnieke te ontleed. Vir die datastel wat ontleed is, dui die resultate van ons bevindinge daarop dat sommige klassifiseerders meer robuust is vir klaswanbalans as antler, en dat prestasietoenames moontlik is wanneer ongebalanseerde leertegnieke by die leerproses ingesluit word. Masters 2023-02-03T09:30:33Z 2023-05-18T06:57:41Z 2023-02-03T09:30:33Z 2023-05-18T06:57:41Z 2023-03 Thesis http://hdl.handle.net/10019.1/126960 en_ZA Stellenbosch University xii, 83 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Natural language processing (Computer science)
Machine learning
Human-computer interaction
Text classification
UCTD
Bezuidenhout, Jean-Pierre
Exploring the class imbalance problem in text classification
title Exploring the class imbalance problem in text classification
title_full Exploring the class imbalance problem in text classification
title_fullStr Exploring the class imbalance problem in text classification
title_full_unstemmed Exploring the class imbalance problem in text classification
title_short Exploring the class imbalance problem in text classification
title_sort exploring the class imbalance problem in text classification
topic Natural language processing (Computer science)
Machine learning
Human-computer interaction
Text classification
UCTD
url http://hdl.handle.net/10019.1/126960
work_keys_str_mv AT bezuidenhoutjeanpierre exploringtheclassimbalanceproblemintextclassification