Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Statistical classification in high-dimensional scenarios with a focus on microarray data sets

Thesis (MCom)--Stellenbosch University, 2017.

Saved in:

Bibliographic Details
Main Author:	Rodseth, Tessa Louise
Other Authors:	Steel, Sarel Johannes
Format:	Thesis
Language:	en_ZA
Published:	2017
Subjects:	Statistical classification Nearest neighbor analysis (Statistics) Discriminant analysis Support vector machines > Classification Variable selection Variables (Mathematics) > Statistical methods UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867614139691565056
access_status_str	Open Access
author	Rodseth, Tessa Louise
author2	Steel, Sarel Johannes
author_browse	Rodseth, Tessa Louise Steel, Sarel Johannes
author_facet	Steel, Sarel Johannes Rodseth, Tessa Louise
author_sort	Rodseth, Tessa Louise
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MCom)--Stellenbosch University, 2017.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/102771
institution	Stellenbosch University (South Africa)
language	en_ZA
last_indexed	2026-06-10T12:47:17.937Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2017
publishDateRange	2017
publishDateSort	2017
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/102771 Statistical classification in high-dimensional scenarios with a focus on microarray data sets Rodseth, Tessa Louise Steel, Sarel Johannes Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Statistical classification Nearest neighbor analysis (Statistics) Discriminant analysis Support vector machines -- Classification Variable selection Variables (Mathematics) -- Statistical methods UCTD Thesis (MCom)--Stellenbosch University, 2017. ENGLISH SUMMARY : High-dimensional data analysis characterises many contemporary problems in statistics and arise in many application areas. This thesis focuses on very high-dimensional problems in which the input predictor variables are gene expression measurements in microarray studies. Accurate analysis of microarray data sets can provide new insight into cancer diagnosis using gene expression profiles and can result in breakthroughs in medical research. K-nearest neighbours (KNN), fastKNN, linear discriminant analysis (and variants thereof), nearest shrunken centroids (NSC) and support vector machines (SVMs) are investigated in this thesis as binary (and multi-class) classification procedures on microarray data sets. The important problem of eliminating redundant input variables before implementing classification procedures in high-dimensional data sets is addressed in this thesis. Several variable selection and dimension reduction procedures suitable for microarray data sets are discussed, with the focus on implementing sure independence techniques, NSC and fastKNN feature engineering in the empirical study. Principal component analysis and supervised principal component analysis are implemented as the two main dimension reduction techniques in this thesis. The performance of the classification procedures is evaluated on three real and three synthetic high-dimensional microarray data sets. The comparison of the different classification methods in the empirical study led to the conclusion that SVMs prove to be the most accurate procedure on the binary data sets considered, whilst NSC is the most accurate procedure on the multi-class data set. AFRIKAANSE OPSOMMING : Hoë-dimensionele data ontledings is in die huidige tydperk kenmerkend van baie praktiese statistiek probleme. In hierdie tesis is die fokus op hoë-dimensionele data met die onafhanklike veranderlikes wat genetiese metings verteenwoordig, tipies van mikro-skyfie studies. Noukeurige ontleding van mikro-skyfie data kan lei tot nuwe insig in byvoorbeeld die diagnose van kanker waar daar van genetiese profieldata gebruik gemaak word. Dit kan uiteraard tot deurbrake in mediese navorsing lei. Die KNN tegniek, die sogenaamde “fastKNN” tegniek, lineêre diskriminantanalise (en variasies daarvan), naaste gekrimpte sentroïedes (NSC) en ondersteuningspunt algoritmes (SVMs) word in hierdie tesis ondersoek as klassifikasie prosedures vir binêre en multi-klas mikro-skyfie probleme. Die belangrike probleem om oortollige en irrelevante veranderlikes uit ’n hoë-dimensionele datastel te elimineer alvorens ’n klassifikasie prosedure daarop toegepas word, word in hierdie tesis aangespreek. Verskeie veranderlike seleksie en dimensie-reduksie prosedures wat geskik is vir toepassing op mikro-skyfie datastelle word bespreek, met die fokus wat geplaas word op “sure independence screening”, NSC en “fastKNN”. Dit verkry veral aandag in die empiriese gedeelte van die studie. Hoofkomponent analise en gerigte hoofkomponent analise word verder as twee van die vernaamste dimensie-reduksie tegnieke in hierdie tesis geïmplementeer. Die gehalte van die klassifikasie prosedures word op drie werklike en drie sintetiese hoë-dimensionele datastelle ge-evalueer. Onderlinge vergelyking van die prosedures in die empiriese studie lei tot die gevolgtrekking dat SVMs die akkuraatste prosedure vir binêre datastelle is, terwyl NSC die akkuraatste prosedure vir die multi-klas datastel was. Masters 2017-11-08T09:06:15Z 2017-12-11T10:52:18Z 2017-11-08T09:06:15Z 2017-12-11T10:52:18Z 2017-12 Thesis http://hdl.handle.net/10019.1/102771 en_ZA Stellenbosch University xv, 266 pages ; illustrations, includes annexures application/pdf
spellingShingle	Statistical classification Nearest neighbor analysis (Statistics) Discriminant analysis Support vector machines -- Classification Variable selection Variables (Mathematics) -- Statistical methods UCTD Rodseth, Tessa Louise Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title	Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_full	Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_fullStr	Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_full_unstemmed	Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_short	Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_sort	statistical classification in high dimensional scenarios with a focus on microarray data sets
topic	Statistical classification Nearest neighbor analysis (Statistics) Discriminant analysis Support vector machines -- Classification Variable selection Variables (Mathematics) -- Statistical methods UCTD
url	http://hdl.handle.net/10019.1/102771
work_keys_str_mv	AT rodsethtessalouise statisticalclassificationinhighdimensionalscenarioswithafocusonmicroarraydatasets

Full Text Available

Statistical classification in high-dimensional scenarios with a focus on microarray data sets

Similar Items