Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Feature selection in gene expression data for disease progression prediction

Thesis (MEng)--Stellenbosch University, 2022.

Saved in:
Bibliographic Details
Main Author: Kritzinger, Daniel Strauss
Other Authors: Nieuwoudt, M. J.
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614138285424640
access_status_str Open Access
author Kritzinger, Daniel Strauss
author2 Nieuwoudt, M. J.
author_browse Kritzinger, Daniel Strauss
Nieuwoudt, M. J.
author_facet Nieuwoudt, M. J.
Kritzinger, Daniel Strauss
author_sort Kritzinger, Daniel Strauss
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2022.
format Thesis
id oai:scholar.sun.ac.za:10019.1/124605
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:47:16.314Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/124605 Feature selection in gene expression data for disease progression prediction Kritzinger, Daniel Strauss Nieuwoudt, M. J. Tromp, G. C. Stellenbosch University. Faculty of Engineering. Dept. of Mechanical and Mechatronic Engineering. Gene expression Prognostic gene signatures Gene expression data Mycobacterium tuberculosis UCTD Thesis (MEng)--Stellenbosch University, 2022. ENGLISH SUMMARY: Gene-expression datasets, be it Microarray or RNA-seq, play a major role in disease identification, and more recently in disease progression identification. The high dimension of these datasets, however, makes the classification challenging due to the presence of many irrelevant and redundant features. This problem therefore necessitates the use of feature selection methods. This project specifically looks at the identification of prognostic gene-signatures for latent to active tuberculosis progression from nested case-control study datasets, i.e., the Grand Challenge 6-74 (GC6-74) longitudinal HIV-negative African cohort of exposed household contacts, as well as the Cape Town region specific adolescent cohort, from the adolescent cohort study (ACS). The project proposes the use of a hybrid 2-stage feature selection framework, and evaluates various components for each phase, as well as the preprocessing steps, to identify the ideal solution. The outcome of this process is Boruta-RFE, which, as stated in the name, consists of the Boruta algorithm in the first phase for the removal of ‘irrelevant’ features, and the RF based recursive feature elimination procedure in the second phase, to identify the ’optimal’ highly relevant and non-redundant feature set from the first phase identified features. The 5 gene signatures identified by Boruta-RFE on the GC6-74 and ACS datasets both deliver promising results, with the GC6-74 signature producing a top result sensitivity of 60.9% and specificity of 82.1%, and the ACS signature producing a top result sensitivity of 62.9% and specificity of 84.6%. These results, however, are generated from the full test sets of each dataset, but once broken down into temporal groupings of 6 months, it becomes evident that closer to diagnosis, case identification capabilities are improved further. The GC6-74 signature produced a top sensitivity of 72.7% for case identification within 6 months, and the ACS signature produced a top sensitivity of 100% for case identification within 6 months. AFRIKAANS OPSOMMING: Geen-uitdrukking datastelle, of dit nou ‘Microarray’ of ‘RNA-seq’ is, speel ’n groot rol in siekte-identifikasie, en meer onlangs in siekteprogressie-identifikasie. Die hoë dimensie van hierdie datastelle maak die klassifikasie egter uitdagend as gevolg van die teenwoordigheid van baie irrelevante en oortollige kenmerke. Hierdie probleem vereis dus die gebruik van kenmerkkeusemetodes. Hierdie projek kyk spesifiek na die identifikasie van prognostiese geenhandtekeninge vir latente tot aktiewe tuberkulose-vordering vanaf beneste gevalle-kontrole-studiedatastelle, d.w.s., die Grand Challenge 6-74 (GC6-74) longitudinale MIV-negatiewe Afrika-kohort van blootgestelde huishoudelike kontakte, sowel as die Kaapstad-streekspesifieke adolessentekohort, uit die adolessentekohortstudie (ACS). Die projek stel die gebruik van ’n hibriede 2-fase kenmerk-keuseraamwerk voor, en evalueer verskeie komponente vir elke fase, sowel as die voorverwerkingstappe, om die ideale oplossing te identifiseer. Die uitkoms van hierdie proses is Boruta-RFE, wat, soos beskryf in die naam, bestaan uit die Boruta-algoritme in die eerste fase, wat spesifiek vir die verwydering van ‘irrelevante’ kenmerke belangrik is, en die ‘Random Forest’ gebaseerde rekursiewe kenmerk eliminasie prosedure in die tweede fase, om die ‘optimale’ hoogs relevante en nie-oortollige kenmerkstel uit die eerste fase geïdentifiseerde kenmerke te identifiseer. Die 5 geen-handtekeninge wat deur Boruta-RFE op die GC6-74- en ACS-datastelle geïdentifiseer is, lewer albei belowende resultate, met die GC6-74-handtekening wat ’n topresultaat sensitiwiteit van 60.9% en spesifisiteit van 82.1% lewer, en die ACS-handtekening wat ’n topresultaat sensitiwiteit van 62.9% en spesifisiteit van 84.6%. Hierdie resultate is egter gegenereer uit die volledige toetsstelle van elke datastel, maar sodra dit in tydelike groepe van 6 maande opgebreek is, word dit duidelik dat nader aan diagnose die vermoëns om gevalle-identifikasie te doen, nog verder verbeter is. Die die GC6-74-handtekening lewer ’n top sensitiwiteit resultaat van 72.7% vir geval-identifikasie binne 6 maande, en die ACS-handtekening lewer ’n top sensitiwiteit resultaat van 100% vir geval-identifikasie binne 6 maande. Masters 2022-03-01T13:49:05Z 2022-04-29T09:21:57Z 2022-03-01T13:49:05Z 2022-04-29T09:21:57Z 2022-04 Thesis http://hdl.handle.net/10019.1/124605 en_ZA Stellenbosch University xv, 91 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Gene expression
Prognostic gene signatures
Gene expression data
Mycobacterium tuberculosis
UCTD
Kritzinger, Daniel Strauss
Feature selection in gene expression data for disease progression prediction
title Feature selection in gene expression data for disease progression prediction
title_full Feature selection in gene expression data for disease progression prediction
title_fullStr Feature selection in gene expression data for disease progression prediction
title_full_unstemmed Feature selection in gene expression data for disease progression prediction
title_short Feature selection in gene expression data for disease progression prediction
title_sort feature selection in gene expression data for disease progression prediction
topic Gene expression
Prognostic gene signatures
Gene expression data
Mycobacterium tuberculosis
UCTD
url http://hdl.handle.net/10019.1/124605
work_keys_str_mv AT kritzingerdanielstrauss featureselectioningeneexpressiondatafordiseaseprogressionprediction