Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Characterizing the complexity of tabular machine learning classification problems

Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177

Saved in:

Bibliographic Details
Main Author:	Erwin, Kyle Harper
Other Authors:	Engelbrecht, Andries
Format:	Thesis
Published:	Stellenbosch : Stellenbosch University 2025
Subjects:	Machine learning Algorithms Data mining Probabilistic automata UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613917861117952
access_status_str	Open Access
author	Erwin, Kyle Harper
author2	Engelbrecht, Andries
author_browse	Engelbrecht, Andries Erwin, Kyle Harper
author_facet	Engelbrecht, Andries Erwin, Kyle Harper
author_sort	Erwin, Kyle Harper
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/132205
institution	Stellenbosch University (South Africa)
last_indexed	2026-06-10T12:43:46.104Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2025
publishDateRange	2025
publishDateSort	2025
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/132205 Characterizing the complexity of tabular machine learning classification problems Erwin, Kyle Harper Engelbrecht, Andries Stellenbosch University. Faculty of Science. Dept. of Computer Science. Machine learning Algorithms Data mining Probabilistic automata UCTD Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177 Thesis (PhD)--Stellenbosch University, 2025. ENGLISH ABSTRACT: Machine learning algorithms are commonly applied to classification problems on tabular datasets. These datasets can be described by their meta-features, such as the number of instances, feature types, and class distribution. Additionally, there exists a subcategory of meta-features, referred to as complexity measures, which estimate the difficulty of a tabular classification problem based on factors such as class ambiguity, data sparsity, and the intricacy of class boundaries. This dissertation conducts a thorough review of the meta-features, including complexity measures and other categories of meta-features. Experimental analysis on existing feature-based complexity measures show that they are inadequate for accurate quantification of the complexity of synthetic multiclass classification datasets. A new measure, called the F5 measure, is proposed, which evaluates the discriminative power of features for each class and better represents feature complexity for the same synthetic multi-class classification datasets. Additionally, a new category of probabilistic complexity measures is also proposed, along with two new probabilistic measures. Experimental analysis shows that these measures complement existing measures and provide more accurate complexity estimates for noisy classification problems. The complexity values of categorically encoded datasets, specifically datasets with nominal data, are computed and statistical tests are used to determine which categorical encoder produces the least complex datasets. The subset of meta-features necessary to predict the performance of a classifier is identified using statistical and experimental analysis. This analysis required optimization of the hyperparameters of several algorithms for 222 tabular classification problems. Furthermore, over 300 meta-features are computed for each classification problem. Regression experiments on the computed meta-features paired with algorithm test performance show that using either the identified subset of complexity meta-features or landmarking meta-features is sufficient to predict algorithm performance. AFRIKAANSE OPSOMMING: Masjienleer algoritmes word dikwels op klassifikasie probleme met tabel datastelle toegepas. Hierdie datastelle kan deur hul meta-kenmerke, soos die aantal gevalle, die eienskap tipes, en klas-verdeling beskryf word. Daar is ook ’n subkategorie van meta-kenmerke wat na verwys word as kompleksiteitsmaatstawwe, wat die moeilikheidsgraad van ’n tabelvorm klassifikasieprobleem skat gebaseer op klas-onduidelikheid, data-ylheid, en die verwikkeldheid van die klassifiseringsgrense. Hierdie tesis voer ’n deeglike oorsig uit van hierdie metakenmerke, insluitend kompleksiteitsmaatstawwe en ander meta-kenmerke kategorieë. Eksperimentele analise op bestaande eienskap-gebaseerde kompleksiteitsmaatstawwe wys dat hierdie onvoldoende is vir akkurate kwantifisering van komplisiteit op sintetiese multi-nomiale klassifikasie d atastelle. ’ n Nuwe maatstaf, die F5 maatstaf, word voorgestel, wat die onderskeidende mag van eienskappe vir elke klas evalueer en wat eienskap-kompleksiteit beter verteenwoordig vir dieselfde sintetiese multi-nomiale datastelle. Verder word ’n nuwe kategorie van waarskynlikheidskompleksiteitsmaatstawwe voorgestel met twee nuwe waarskynlikheidsmaatstawwe. Eksperimentele analise wys dat hierdie maatstawwe bestaande maatstawwe komplimenteer en meer akkurate kompleksiteitsskattings vir data wat ruiswaardes bevat, gee. Die kompleksiteitswaardes van kategoriese data word bereken en statistiese toetse word gedoen om te bepaal watter kategorie enkodeerder die mins-komplekse datastelle genereer. Die sub-stel meta-kenmerke wat nodig is om die prestasie van ’n klassifikasie sisteem te b epaal, word met behulp van s tatistiese en eksperimentele analise geïdentifiseer. Die analise het die optimering van hiperparameters van is vir elke klassifikasie probleem bereken. Regressie eksperimente op die berekende meta-kenmerke saam met algoritme prestasie wys dat die geïdentifiseerde sub-stel of landmerk kompleksiteit meta-kenmerke voldoende is om algoritme prestasie te voorspel. Doctoral 2025-05-29T12:43:12Z 2025-05-29T12:43:12Z 2025-03 Thesis https://scholar.sun.ac.za/handle/10019.1/132205 Stellenbosch University xvii, 141 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Machine learning Algorithms Data mining Probabilistic automata UCTD Erwin, Kyle Harper Characterizing the complexity of tabular machine learning classification problems
title	Characterizing the complexity of tabular machine learning classification problems
title_full	Characterizing the complexity of tabular machine learning classification problems
title_fullStr	Characterizing the complexity of tabular machine learning classification problems
title_full_unstemmed	Characterizing the complexity of tabular machine learning classification problems
title_short	Characterizing the complexity of tabular machine learning classification problems
title_sort	characterizing the complexity of tabular machine learning classification problems
topic	Machine learning Algorithms Data mining Probabilistic automata UCTD
url	https://scholar.sun.ac.za/handle/10019.1/132205
work_keys_str_mv	AT erwinkyleharper characterizingthecomplexityoftabularmachinelearningclassificationproblems

Full Text Available

Characterizing the complexity of tabular machine learning classification problems

Similar Items