Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Visualising data through biplots using Categorical PCA and clustering

Thesis (MCom)--Stellenbosch University, 2022.

Saved in:
Bibliographic Details
Main Author: Van Dyk, Wilmari
Other Authors: Van der Merwe, Carel
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614112247185408
access_status_str Open Access
author Van Dyk, Wilmari
author2 Van der Merwe, Carel
author_browse Van Dyk, Wilmari
Van der Merwe, Carel
author_facet Van der Merwe, Carel
Van Dyk, Wilmari
author_sort Van Dyk, Wilmari
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MCom)--Stellenbosch University, 2022.
format Thesis
id oai:scholar.sun.ac.za:10019.1/126109
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:46:51.765Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/126109 Visualising data through biplots using Categorical PCA and clustering Van Dyk, Wilmari Van der Merwe, Carel Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Cluster analysis Principal Component Analysis Multivariate analysis UCTD Thesis (MCom)--Stellenbosch University, 2022. ENGLISH SUMMARY: Handling large data sets have become an everyday occurrence and the need for efficiently processing and interpreting data have increased tremendously over the last couple of years. The easiest way to interpret data quickly is to have a visual representation of the data. Since data is often multidimensional, the use of biplots have become more frequent. Biplots are a tool that allows for multidimensional data to be displayed on a two- or three-dimensional graph. The first step in constructing such a plot would be to apply some dimension reduction technique to transform a data set from a high dimensional space to a lower dimensional space. Depending on the type of data that needs to be transformed, the most often used dimension reduction techniques are principal component analysis (PCA) for continuous data or multiple correspondence analysis (MCA) for categorical data. When conducting unsupervised learning, inferences need to be made on a data set regarding the relationships among the different variables. Clustering is very useful for this purpose. There are various clustering techniques that can be used to cluster data, depending on the type of data that needs to be analysed. More specifically, for continuous data, reduced k-means, or factorial k-means can be used and for categorical data, MCA k-means, cluster correspondence analysis, and iterative factorial clustering are often used. The purpose of this assignment is to develop a R-function that can apply some dimension reduction and clustering techniques to categorical data to transform the data in such a way that it can be represented on a biplot and inference can be made regarding certain relationships within the data. Categorical PCA will be used as a dimension reduction technique to transform the data from a higher dimension to a lower dimension. Since Categorical PCA gives scores to the category levels by focusing on individual categories, the categories become numerical which means they can be displayed on straight line axes. While the dimension reduction takes place, the function will also attempt to cluster the data using either reduced k-means or factorial k-means. After the data is transformed, it can be displayed on a biplot with many additional features to enhance the biplot. AFRIKAANSE OPSOMMING: Die hantering van groot datastelle het ’n alledaagse gebeurtenis geword en die behoefte aan doeltreffende verwerking en interpretasie van data het oor die afgelope paar jaar geweldig toegeneem. Die maklikste manier om data vinnig te interpreteer, is om ’n visuele voorstelling van die data te he. Aangesien data dikwels multidimensioneel is, het die gebruik van bistippings meer algemeen geword. Bistippings is ’n tegniek wat dit moontlik maak om multidimensionele data op ’n twee- of drie-dimensionele grafiek voor te stel. Die eerste stap sal wees om een of ander dimensieverminderingstegniek toe te pas om die data van ’n hoe dimensionele ruimte na ’n laer dimensionele ruimte te transformeer. Afhangende van die tipe data wat getransformeer moet word, is die mees algemeenste dimensieverminderingstegnieke hoofkomponentanalise vir numeriese data of meervoudige korrespondensie analise vir kategoriese data. Wanneer daar met data gewerk word wat nie ’n onafhanklike veranderlike het nie, moet afleidings gemaak word oor die verwantskappe tussen die verskillende veranderlikes. Groepering is baie nuttig vir hierdie doel. Daar is verskeie groeperingstegnieke wat gebruik kan word om data te groepeer, afhangende van die tipe data. Meer spesifiek, vir deurlopende data, kan verminderde k-gemiddelde of faktoriale k-gemiddelde gebruik word. Vir kategoriese data kan meervoudige korrespondensie analise k-gemiddelde, kluster korrespondensie analise en iteratiewe faktoriale groepering gebruik word. Die doel van hierdie werkopdrag is om ’n R-funksie te ontwikkel wat een of ander dimensieverminderings- en groeperingstegniek op kategoriese data kan toepas om die data so te transformeer dat dit op ’n bistipping voorgestel kan word. Sekere afleidings kan dan gemaak word oor moontlike verwantskappe binne die data. Kategoriese hoofkomponentanalise sal gebruik word as ’n dimensieverminderingstegniek om die data van ’n hoer dimensie na ’n laer dimensie te transformeer. Aangesien Kategoriese hoofkomponentanalise tellings aan die kategorievlakke gee deur op individuele kategoriee te fokus, word die kategoriee numeries wat beteken die data kan op reguitlyn-asse voorgestel word. Terwyl die dimensievermindering plaasvind, sal die funksie ook probeer om die data te groepeer deur of verminderde k-gemiddelde of faktoriale k-gemiddelde te gebruik. Nadat die data getransformeer is, kan dit op ’n bistipping voorgestel word met baie bykomende funksies om die voorstelling van die bistipping te verbeter. Masters 2022-11-21T15:12:53Z 2023-01-16T12:50:31Z 2022-11-21T15:12:53Z 2023-01-16T12:50:31Z 2022-12 Thesis http://hdl.handle.net/10019.1/126109 en_ZA Stellenbosch University xii, 88 pages : illustrations, includes annexures application/pdf Stellenbosch : Stellenbosch University
spellingShingle Cluster analysis
Principal Component Analysis
Multivariate analysis
UCTD
Van Dyk, Wilmari
Visualising data through biplots using Categorical PCA and clustering
title Visualising data through biplots using Categorical PCA and clustering
title_full Visualising data through biplots using Categorical PCA and clustering
title_fullStr Visualising data through biplots using Categorical PCA and clustering
title_full_unstemmed Visualising data through biplots using Categorical PCA and clustering
title_short Visualising data through biplots using Categorical PCA and clustering
title_sort visualising data through biplots using categorical pca and clustering
topic Cluster analysis
Principal Component Analysis
Multivariate analysis
UCTD
url http://hdl.handle.net/10019.1/126109
work_keys_str_mv AT vandykwilmari visualisingdatathroughbiplotsusingcategoricalpcaandclustering