Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Characterizing the complexity of tabular machine learning classification problems

Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177

Saved in:
Bibliographic Details
Main Author: Erwin, Kyle Harper
Other Authors: Engelbrecht, Andries
Format: Thesis
Published: Stellenbosch : Stellenbosch University 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613917861117952
access_status_str Open Access
author Erwin, Kyle Harper
author2 Engelbrecht, Andries
author_browse Engelbrecht, Andries
Erwin, Kyle Harper
author_facet Engelbrecht, Andries
Erwin, Kyle Harper
author_sort Erwin, Kyle Harper
collection Thesis
dc_rights_str_mv Stellenbosch University
description Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177
format Thesis
id oai:scholar.sun.ac.za:10019.1/132205
institution Stellenbosch University (South Africa)
last_indexed 2026-06-10T12:43:46.104Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/132205 Characterizing the complexity of tabular machine learning classification problems Erwin, Kyle Harper Engelbrecht, Andries Stellenbosch University. Faculty of Science. Dept. of Computer Science. Machine learning Algorithms Data mining Probabilistic automata UCTD Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177 Thesis (PhD)--Stellenbosch University, 2025. ENGLISH ABSTRACT: Machine learning algorithms are commonly applied to classification problems on tabular datasets. These datasets can be described by their meta-features, such as the number of instances, feature types, and class distribution. Additionally, there exists a subcategory of meta-features, referred to as complexity measures, which estimate the difficulty of a tabular classification problem based on factors such as class ambiguity, data sparsity, and the intricacy of class boundaries. This dissertation conducts a thorough review of the meta-features, including complexity measures and other categories of meta-features. Experimental analysis on existing feature-based complexity measures show that they are inadequate for accurate quantification of the complexity of synthetic multiclass classification datasets. A new measure, called the F5 measure, is proposed, which evaluates the discriminative power of features for each class and better represents feature complexity for the same synthetic multi-class classification datasets. Additionally, a new category of probabilistic complexity measures is also proposed, along with two new probabilistic measures. Experimental analysis shows that these measures complement existing measures and provide more accurate complexity estimates for noisy classification problems. The complexity values of categorically encoded datasets, specifically datasets with nominal data, are computed and statistical tests are used to determine which categorical encoder produces the least complex datasets. The subset of meta-features necessary to predict the performance of a classifier is identified using statistical and experimental analysis. This analysis required optimization of the hyperparameters of several algorithms for 222 tabular classification problems. Furthermore, over 300 meta-features are computed for each classification problem. Regression experiments on the computed meta-features paired with algorithm test performance show that using either the identified subset of complexity meta-features or landmarking meta-features is sufficient to predict algorithm performance. AFRIKAANSE OPSOMMING: Masjienleer algoritmes word dikwels op klassifikasie probleme met tabel datastelle toegepas. Hierdie datastelle kan deur hul meta-kenmerke, soos die aantal gevalle, die eienskap tipes, en klas-verdeling beskryf word. Daar is ook ’n subkategorie van meta-kenmerke wat na verwys word as kompleksiteitsmaatstawwe, wat die moeilikheidsgraad van ’n tabelvorm klassifikasieprobleem skat gebaseer op klas-onduidelikheid, data-ylheid, en die verwikkeldheid van die klassifiseringsgrense. Hierdie tesis voer ’n deeglike oorsig uit van hierdie metakenmerke, insluitend kompleksiteitsmaatstawwe en ander meta-kenmerke kategorieë. Eksperimentele analise op bestaande eienskap-gebaseerde kompleksiteitsmaatstawwe wys dat hierdie onvoldoende is vir akkurate kwantifisering van komplisiteit op sintetiese multi-nomiale klassifikasie d atastelle. ’ n Nuwe maatstaf, die F5 maatstaf, word voorgestel, wat die onderskeidende mag van eienskappe vir elke klas evalueer en wat eienskap-kompleksiteit beter verteenwoordig vir dieselfde sintetiese multi-nomiale datastelle. Verder word ’n nuwe kategorie van waarskynlikheidskompleksiteitsmaatstawwe voorgestel met twee nuwe waarskynlikheidsmaatstawwe. Eksperimentele analise wys dat hierdie maatstawwe bestaande maatstawwe komplimenteer en meer akkurate kompleksiteitsskattings vir data wat ruiswaardes bevat, gee. Die kompleksiteitswaardes van kategoriese data word bereken en statistiese toetse word gedoen om te bepaal watter kategorie enkodeerder die mins-komplekse datastelle genereer. Die sub-stel meta-kenmerke wat nodig is om die prestasie van ’n klassifikasie sisteem te b epaal, word met behulp van s tatistiese en eksperimentele analise geïdentifiseer. Die analise het die optimering van hiperparameters van is vir elke klassifikasie probleem bereken. Regressie eksperimente op die berekende meta-kenmerke saam met algoritme prestasie wys dat die geïdentifiseerde sub-stel of landmerk kompleksiteit meta-kenmerke voldoende is om algoritme prestasie te voorspel. Doctoral 2025-05-29T12:43:12Z 2025-05-29T12:43:12Z 2025-03 Thesis https://scholar.sun.ac.za/handle/10019.1/132205 Stellenbosch University xvii, 141 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Machine learning
Algorithms
Data mining
Probabilistic automata
UCTD
Erwin, Kyle Harper
Characterizing the complexity of tabular machine learning classification problems
title Characterizing the complexity of tabular machine learning classification problems
title_full Characterizing the complexity of tabular machine learning classification problems
title_fullStr Characterizing the complexity of tabular machine learning classification problems
title_full_unstemmed Characterizing the complexity of tabular machine learning classification problems
title_short Characterizing the complexity of tabular machine learning classification problems
title_sort characterizing the complexity of tabular machine learning classification problems
topic Machine learning
Algorithms
Data mining
Probabilistic automata
UCTD
url https://scholar.sun.ac.za/handle/10019.1/132205
work_keys_str_mv AT erwinkyleharper characterizingthecomplexityoftabularmachinelearningclassificationproblems