Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Label-dependent splitting for multi-label data

Thesis (PhD)--Stellenbosch University, 2023.

Saved in:
Bibliographic Details
Main Author: Muller, Annegret
Other Authors: Steel, S. J.
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613994583326720
access_status_str Open Access
author Muller, Annegret
author2 Steel, S. J.
author_browse Muller, Annegret
Steel, S. J.
author_facet Steel, S. J.
Muller, Annegret
author_sort Muller, Annegret
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (PhD)--Stellenbosch University, 2023.
format Thesis
id oai:scholar.sun.ac.za:10019.1/128900
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:44:59.428Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/128900 Label-dependent splitting for multi-label data Muller, Annegret Steel, S. J. Sandrock, T. Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Supervised learning (Machine learning) Predictive control Artificial Intelligence Computational intelligence Regression analysis -- Data processing Mathematical statistics UCTD Thesis (PhD)--Stellenbosch University, 2023. ENGLISH SUMMARY: Multi-label classification problems arise in scenarios where every data case can be associated with multiple labels simultaneously. Compared to single-label data, multi-label data possess unique characteristics which result in additional challenges when analysing the data. The aim of this dissertation is to address two of these challenging aspects of multi-label data. The first is the exploitation of label correlations to achieve accurate classification of unseen data cases. Secondly, strategies for input variable ranking within multi-label data are considered to allow for more interpretable results. Effective exploitation of correlation amongst labels can be a vital attribute of an accurate multilabel classification method. However, label correlations are not necessarily shared globally by all data cases. Despite this, existing methods mostly focus on global exploitation of label correlations. Therefore, a new tree-based ensemble method for multi-label classification is proposed in this dissertation, Label-Dependent splitting (LDsplit). LDsplit aims to implicitly exploit local higher-order label correlations within multi-label data by dividing the data into subgroups. The algorithm fits an ensemble of trees based on differently ordered label subsets. For each tree, different labels are used at different levels of the tree, as determined by the label order applicable to that tree. The tree-levels are made up of nodes that are split using any binary classifier. Since a tree-level depends on its label as well as previous splits made when parent nodes were formed using other labels, higher-order label correlations are implicitly incorporated into the model in a simple manner. Depending on whether random or predetermined label orders are used to fit the ensemble, either Random LDsplit or Conditional LDsplit is fit. An extensive empirical study is performed on a range of multi-label benchmark datasets. The empirical evidence shows that despite the simple framework, both Random LDsplit and Conditional LDsplit offer very competitive classification performance in comparison with existing multi-label classification methods. For multi-label data, an input variable is globally important if it is deemed important for several or all labels. However, an input variable can also be deemed locally important for a specific label. Few proposals for input variable ranking within multi-label data consider both global and local importance of variables. Moreover, existing methods mostly neglect to exploit label dependencies within the data. Therefore, different ways are outlined how an LDsplit ensemble can produce global and local input variable rankings and effectively allow for better interpretation of the data. Results obtained from synthetically generated multi-label datasets demonstrate that both the novel global and local importance measures give favourable performance. AFRIKAANSE OPSOMMING: Multi-etiket klassifikasie probleme ontstaan in scenario’s waar elke datageval gelyktydig met verskeie etikette geassosieer kan word. In vergelyking met enkel-etiket data, beskik multietiket data unieke eienskappe wat tot addisionele uitdagings lei wanneer die data ontleed word. Die doelwit van hierdie skripsie is om twee van hierdie uitdagende aspekte van multietiket data aan te spreek. Die eerste is die benutting van etiket-korrelasies, sodat akkurate klassifikasie van ongesiene datagevalle bereik kan word. Tweedens word strategiee vir insetveranderlike rangskikking binne multi-etiket data beskou vir meer interpreteerbare resultate. Effektiewe benutting van korrelasie tussen etikette kan ‘n noodsaaklike eienskap van ‘n akkurate multi-etiket klassifikasie-metode wees. Tog word etiket-korrelasies nie noodwendig globaal deur alle datagevalle gedeel nie. Ten spyte hiervan fokus bestaande metodes meestal op globale benutting van etiket-korrelasies. Daarom word ‘n nuwe boom-gebaseerde ensemble metode vir multi-etiket klassifikasie in hierdie skripsie voorgestel, naamlik Etiket- Afhanklike splitting (LDsplit). LDsplit beoog om implisiet plaaslike hoer-orde etiket-korrelasies binne multi-etiket data te benut deur die data in subgroepe te verdeel. Die algoritme pas ‘n ensemble van bome gebaseer op verskillende gerangskikte etiket-subversamelings. Vir elke boom word verskillende etikette by verskillende vlakke van die boom gebruik, soos gebaseer op die toepaslike etiket-rangskikking van die boom. Die boom-vlakke bestaan uit nodusse wat verdeel word deur gebruik te maak van enige binere-klassifiseerder. Aangesien ‘n boom-vlak afhanklik is van sy etiket, sowel as vorige verdelings wat gemaak was toe voorouer-nodusse gevorm het deur van ander etikette gebruik te maak, word hoer-orde etiket-korrelasies implisiet op ‘n eenvoudige manier in die model geinkorporeer. Afhangende of ewekansige of voorafbepaalde etiket-rangskikkings gebruik word om die ensemble te pas, word of Ewekansige LDsplit of Voorwaardelike LDsplit gepas. ‘n Omvangryke empiriese studie word uitgevoer op ‘n reeks standaard multi-etiket datastelle. Die empiriese resultate dui daarop dat, ten spyte van die eenvoudige raamwerk, beide Ewekansige LDsplit en Voorwaardelike LDsplit baie kompeterende klassifikasie-prestasie bied in vergelyking met bestaande multi-etiket klassifikasie-metodes. Vir multi-etiket data is ‘n inset-veranderlike globaal belangrik as dit belangrik geag word vir verskeie of alle etikette. Dog kan ‘n inset-veranderlike ook lokaal belangrik geag word vir ‘n spesifieke etiket. Min strategiee vir inset-veranderlike rangskikking binne multi-etiket data beskou beide globale en lokale belangrikheid van inset-veranderlikes. Bowendien, vir die bestaande metodes word die benutting van etiket-afhanklikheid binne die data meestal uitgelaat. Derhalwe word verskillende maniere uiteengesit hoe ‘n LDsplit ensemble globale en lokale inset-veranderlike rangskikkings kan genereer en sodoende beter interpretasie van die data toelaat. Resultate verkry op grond van sintetiese gegenereerde multi-etiket datastelle wys dat beide die nuwe globale en lokale belangrikheids-maatstawwe goed presteer. Doctorate 2023-10-17T07:54:42Z 2024-01-08T15:07:31Z 2023-10-17T07:54:42Z 2024-01-08T15:07:31Z 2023-12 Thesis https://scholar.sun.ac.za/handle/10019.1/128900 en_ZA Stellenbosch University xviii, 354 pages : illustrations, includes annexures application/pdf Stellenbosch : Stellenbosch University
spellingShingle Supervised learning (Machine learning)
Predictive control
Artificial Intelligence
Computational intelligence
Regression analysis -- Data processing
Mathematical statistics
UCTD
Muller, Annegret
Label-dependent splitting for multi-label data
title Label-dependent splitting for multi-label data
title_full Label-dependent splitting for multi-label data
title_fullStr Label-dependent splitting for multi-label data
title_full_unstemmed Label-dependent splitting for multi-label data
title_short Label-dependent splitting for multi-label data
title_sort label dependent splitting for multi label data
topic Supervised learning (Machine learning)
Predictive control
Artificial Intelligence
Computational intelligence
Regression analysis -- Data processing
Mathematical statistics
UCTD
url https://scholar.sun.ac.za/handle/10019.1/128900
work_keys_str_mv AT mullerannegret labeldependentsplittingformultilabeldata