Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Statistical classification in high-dimensional scenarios with a focus on microarray data sets

Thesis (MCom)--Stellenbosch University, 2017.

Saved in:
Bibliographic Details
Main Author: Rodseth, Tessa Louise
Other Authors: Steel, Sarel Johannes
Format: Thesis
Language:en_ZA
Published: 2017
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614139691565056
access_status_str Open Access
author Rodseth, Tessa Louise
author2 Steel, Sarel Johannes
author_browse Rodseth, Tessa Louise
Steel, Sarel Johannes
author_facet Steel, Sarel Johannes
Rodseth, Tessa Louise
author_sort Rodseth, Tessa Louise
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MCom)--Stellenbosch University, 2017.
format Thesis
id oai:scholar.sun.ac.za:10019.1/102771
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:47:17.937Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2017
publishDateRange 2017
publishDateSort 2017
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/102771 Statistical classification in high-dimensional scenarios with a focus on microarray data sets Rodseth, Tessa Louise Steel, Sarel Johannes Stellenbosch University. Faculty of Economic and Management Sciences. Dept. of Statistics and Actuarial Science. Statistical classification Nearest neighbor analysis (Statistics) Discriminant analysis Support vector machines -- Classification Variable selection Variables (Mathematics) -- Statistical methods UCTD Thesis (MCom)--Stellenbosch University, 2017. ENGLISH SUMMARY : High-dimensional data analysis characterises many contemporary problems in statistics and arise in many application areas. This thesis focuses on very high-dimensional problems in which the input predictor variables are gene expression measurements in microarray studies. Accurate analysis of microarray data sets can provide new insight into cancer diagnosis using gene expression profiles and can result in breakthroughs in medical research. K-nearest neighbours (KNN), fastKNN, linear discriminant analysis (and variants thereof), nearest shrunken centroids (NSC) and support vector machines (SVMs) are investigated in this thesis as binary (and multi-class) classification procedures on microarray data sets. The important problem of eliminating redundant input variables before implementing classification procedures in high-dimensional data sets is addressed in this thesis. Several variable selection and dimension reduction procedures suitable for microarray data sets are discussed, with the focus on implementing sure independence techniques, NSC and fastKNN feature engineering in the empirical study. Principal component analysis and supervised principal component analysis are implemented as the two main dimension reduction techniques in this thesis. The performance of the classification procedures is evaluated on three real and three synthetic high-dimensional microarray data sets. The comparison of the different classification methods in the empirical study led to the conclusion that SVMs prove to be the most accurate procedure on the binary data sets considered, whilst NSC is the most accurate procedure on the multi-class data set. AFRIKAANSE OPSOMMING : Hoë-dimensionele data ontledings is in die huidige tydperk kenmerkend van baie praktiese statistiek probleme. In hierdie tesis is die fokus op hoë-dimensionele data met die onafhanklike veranderlikes wat genetiese metings verteenwoordig, tipies van mikro-skyfie studies. Noukeurige ontleding van mikro-skyfie data kan lei tot nuwe insig in byvoorbeeld die diagnose van kanker waar daar van genetiese profieldata gebruik gemaak word. Dit kan uiteraard tot deurbrake in mediese navorsing lei. Die KNN tegniek, die sogenaamde “fastKNN” tegniek, lineêre diskriminantanalise (en variasies daarvan), naaste gekrimpte sentroïedes (NSC) en ondersteuningspunt algoritmes (SVMs) word in hierdie tesis ondersoek as klassifikasie prosedures vir binêre en multi-klas mikro-skyfie probleme. Die belangrike probleem om oortollige en irrelevante veranderlikes uit ’n hoë-dimensionele datastel te elimineer alvorens ’n klassifikasie prosedure daarop toegepas word, word in hierdie tesis aangespreek. Verskeie veranderlike seleksie en dimensie-reduksie prosedures wat geskik is vir toepassing op mikro-skyfie datastelle word bespreek, met die fokus wat geplaas word op “sure independence screening”, NSC en “fastKNN”. Dit verkry veral aandag in die empiriese gedeelte van die studie. Hoofkomponent analise en gerigte hoofkomponent analise word verder as twee van die vernaamste dimensie-reduksie tegnieke in hierdie tesis geïmplementeer. Die gehalte van die klassifikasie prosedures word op drie werklike en drie sintetiese hoë-dimensionele datastelle ge-evalueer. Onderlinge vergelyking van die prosedures in die empiriese studie lei tot die gevolgtrekking dat SVMs die akkuraatste prosedure vir binêre datastelle is, terwyl NSC die akkuraatste prosedure vir die multi-klas datastel was. Masters 2017-11-08T09:06:15Z 2017-12-11T10:52:18Z 2017-11-08T09:06:15Z 2017-12-11T10:52:18Z 2017-12 Thesis http://hdl.handle.net/10019.1/102771 en_ZA Stellenbosch University xv, 266 pages ; illustrations, includes annexures application/pdf
spellingShingle Statistical classification
Nearest neighbor analysis (Statistics)
Discriminant analysis
Support vector machines -- Classification
Variable selection
Variables (Mathematics) -- Statistical methods
UCTD
Rodseth, Tessa Louise
Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_full Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_fullStr Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_full_unstemmed Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_short Statistical classification in high-dimensional scenarios with a focus on microarray data sets
title_sort statistical classification in high dimensional scenarios with a focus on microarray data sets
topic Statistical classification
Nearest neighbor analysis (Statistics)
Discriminant analysis
Support vector machines -- Classification
Variable selection
Variables (Mathematics) -- Statistical methods
UCTD
url http://hdl.handle.net/10019.1/102771
work_keys_str_mv AT rodsethtessalouise statisticalclassificationinhighdimensionalscenarioswithafocusonmicroarraydatasets