Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Published: |
Stellenbosch : Stellenbosch University
2025
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613917861117952 |
|---|---|
| access_status_str | Open Access |
| author | Erwin, Kyle Harper |
| author2 | Engelbrecht, Andries |
| author_browse | Engelbrecht, Andries Erwin, Kyle Harper |
| author_facet | Engelbrecht, Andries Erwin, Kyle Harper |
| author_sort | Erwin, Kyle Harper |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Erwin, K. H. 2025. Characterizing the Complexity of Tabular
Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177 |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/132205 |
| institution | Stellenbosch University (South Africa) |
| last_indexed | 2026-06-10T12:43:46.104Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2025 |
| publishDateRange | 2025 |
| publishDateSort | 2025 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/132205 Characterizing the complexity of tabular machine learning classification problems Erwin, Kyle Harper Engelbrecht, Andries Stellenbosch University. Faculty of Science. Dept. of Computer Science. Machine learning Algorithms Data mining Probabilistic automata UCTD Erwin, K. H. 2025. Characterizing the Complexity of Tabular Machine Learning Classification Problems. Unpublished doctoral dissertation. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/712703ce-49c4-415f-b1b5-be4ab1710177 Thesis (PhD)--Stellenbosch University, 2025. ENGLISH ABSTRACT: Machine learning algorithms are commonly applied to classification problems on tabular datasets. These datasets can be described by their meta-features, such as the number of instances, feature types, and class distribution. Additionally, there exists a subcategory of meta-features, referred to as complexity measures, which estimate the difficulty of a tabular classification problem based on factors such as class ambiguity, data sparsity, and the intricacy of class boundaries. This dissertation conducts a thorough review of the meta-features, including complexity measures and other categories of meta-features. Experimental analysis on existing feature-based complexity measures show that they are inadequate for accurate quantification of the complexity of synthetic multiclass classification datasets. A new measure, called the F5 measure, is proposed, which evaluates the discriminative power of features for each class and better represents feature complexity for the same synthetic multi-class classification datasets. Additionally, a new category of probabilistic complexity measures is also proposed, along with two new probabilistic measures. Experimental analysis shows that these measures complement existing measures and provide more accurate complexity estimates for noisy classification problems. The complexity values of categorically encoded datasets, specifically datasets with nominal data, are computed and statistical tests are used to determine which categorical encoder produces the least complex datasets. The subset of meta-features necessary to predict the performance of a classifier is identified using statistical and experimental analysis. This analysis required optimization of the hyperparameters of several algorithms for 222 tabular classification problems. Furthermore, over 300 meta-features are computed for each classification problem. Regression experiments on the computed meta-features paired with algorithm test performance show that using either the identified subset of complexity meta-features or landmarking meta-features is sufficient to predict algorithm performance. AFRIKAANSE OPSOMMING: Masjienleer algoritmes word dikwels op klassifikasie probleme met tabel datastelle toegepas. Hierdie datastelle kan deur hul meta-kenmerke, soos die aantal gevalle, die eienskap tipes, en klas-verdeling beskryf word. Daar is ook ’n subkategorie van meta-kenmerke wat na verwys word as kompleksiteitsmaatstawwe, wat die moeilikheidsgraad van ’n tabelvorm klassifikasieprobleem skat gebaseer op klas-onduidelikheid, data-ylheid, en die verwikkeldheid van die klassifiseringsgrense. Hierdie tesis voer ’n deeglike oorsig uit van hierdie metakenmerke, insluitend kompleksiteitsmaatstawwe en ander meta-kenmerke kategorieë. Eksperimentele analise op bestaande eienskap-gebaseerde kompleksiteitsmaatstawwe wys dat hierdie onvoldoende is vir akkurate kwantifisering van komplisiteit op sintetiese multi-nomiale klassifikasie d atastelle. ’ n Nuwe maatstaf, die F5 maatstaf, word voorgestel, wat die onderskeidende mag van eienskappe vir elke klas evalueer en wat eienskap-kompleksiteit beter verteenwoordig vir dieselfde sintetiese multi-nomiale datastelle. Verder word ’n nuwe kategorie van waarskynlikheidskompleksiteitsmaatstawwe voorgestel met twee nuwe waarskynlikheidsmaatstawwe. Eksperimentele analise wys dat hierdie maatstawwe bestaande maatstawwe komplimenteer en meer akkurate kompleksiteitsskattings vir data wat ruiswaardes bevat, gee. Die kompleksiteitswaardes van kategoriese data word bereken en statistiese toetse word gedoen om te bepaal watter kategorie enkodeerder die mins-komplekse datastelle genereer. Die sub-stel meta-kenmerke wat nodig is om die prestasie van ’n klassifikasie sisteem te b epaal, word met behulp van s tatistiese en eksperimentele analise geïdentifiseer. Die analise het die optimering van hiperparameters van is vir elke klassifikasie probleem bereken. Regressie eksperimente op die berekende meta-kenmerke saam met algoritme prestasie wys dat die geïdentifiseerde sub-stel of landmerk kompleksiteit meta-kenmerke voldoende is om algoritme prestasie te voorspel. Doctoral 2025-05-29T12:43:12Z 2025-05-29T12:43:12Z 2025-03 Thesis https://scholar.sun.ac.za/handle/10019.1/132205 Stellenbosch University xvii, 141 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Machine learning Algorithms Data mining Probabilistic automata UCTD Erwin, Kyle Harper Characterizing the complexity of tabular machine learning classification problems |
| title | Characterizing the complexity of tabular machine learning classification problems |
| title_full | Characterizing the complexity of tabular machine learning classification problems |
| title_fullStr | Characterizing the complexity of tabular machine learning classification problems |
| title_full_unstemmed | Characterizing the complexity of tabular machine learning classification problems |
| title_short | Characterizing the complexity of tabular machine learning classification problems |
| title_sort | characterizing the complexity of tabular machine learning classification problems |
| topic | Machine learning Algorithms Data mining Probabilistic automata UCTD |
| url | https://scholar.sun.ac.za/handle/10019.1/132205 |
| work_keys_str_mv | AT erwinkyleharper characterizingthecomplexityoftabularmachinelearningclassificationproblems |