Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology

This thesis is grounded in the fundamental observation that biological data has shape and this shape matters. Beneath the high-dimensional, often noisy landscape of gene expression profiles lie hidden topological structures (connected components, loops and voids) that capture the complex relationshi...

Full description

Saved in:
Bibliographic Details
Main Author: Nyase, Ndivhuwo
Other Authors: Mashatola, Lebohang
Format: Thesis
Language:English
English
Published: Department of Statistical Sciences 2026
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613166336212992
access_status_str Open Access
author Nyase, Ndivhuwo
author2 Mashatola, Lebohang
author_browse Mashatola, Lebohang
Nyase, Ndivhuwo
author_facet Mashatola, Lebohang
Nyase, Ndivhuwo
author_sort Nyase, Ndivhuwo
collection Thesis
description This thesis is grounded in the fundamental observation that biological data has shape and this shape matters. Beneath the high-dimensional, often noisy landscape of gene expression profiles lie hidden topological structures (connected components, loops and voids) that capture the complex relationships driving cancer development and progression. By embracing this perspective, we position Topological Data Analysis (TDA) and persistent homology at the core of a novel analytical framework designed to tackle two key challenges in cancer research: clinical outcome prediction and biomarker discovery. In this study, we employ Weighted Gene Topological Data Analysis (WGTDA) to extract topological features from gene expression data, which serve as prognostic biomarkers for cancer classification, staging, and treatment response. Moreover, by integrating these topological features with machine learning models we aim to enhance the predictive accuracy for clinical outcomes. For clinical outcome prediction, we transformed gene expression profiles into topological fingerprints using multiple co-expression measures—namely, Pearson Correlation, Distance Correlation, and Weighted Topological Overlap (wTO) computed with both Pearson and Distance-based adjacencies. These topological features were analyzed using Random Forests. In parallel, we compared the predictive performance of traditional machine learning models (SVM, Gradient Boosting Decision Trees, Random Forest, and Neural Networks) trained on raw gene expression data against models incorporating the topological fingerprints. This comparative analysis was conducted across three classification tasks: cancer type (using TCGA-SARC, TCGA-PCPG, and TCGA-ESCA datasets), cancer staging (using TCGA-HNSC for stages I–IV), and treatment response (responders vs. non-responders). For biomarker identification, the same three tasks were applied using the best performing co-expression measure to generate a global topological representation of the patient population. This provided a disease-level view, highlighting shared homological patterns to facilitate biomarker discovery. Additionally, a dedicated visualization tool has been developed to aid in interpreting these topological signatures and identifying critical biomarkers. The tool is available at https://nnyase.github.io/MSc-Thesis/ WGTDA significantly enhanced phenotype prediction tasks by overcoming common pitfalls of traditional ML models in RNA-Seq data, such as overfitting and poor handling of class imbalance. TDA-derived features improved generalizability of ML models in tasks such as cancer staging and treatment response prediction. Our findings strongly support the integration of TDA into clinical outcome prediction, demonstrating its value in capturing nuanced patterns that allow ML methods to learn more effectively. Moreover, WGTDA remarkably identified key gene signatures for cancer type, staging, and treatment response without relying on pre-existing biological assumptions, yielding biomarkers that are strongly supported by the existing literature. These results underscore the method's reliability and potential clinical utility in precision oncology.
format Thesis
id oai:open.uct.ac.za:11427/42574
institution University of Cape Town (South Africa)
language English
eng
last_indexed 2026-06-10T12:31:48.735Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Department of Statistical Sciences
publisherStr Department of Statistical Sciences
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/42574 Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology Nyase, Ndivhuwo Mashatola, Lebohang Muller, Julia Sinkala, Musalula Oncology Topology-driven biomarker This thesis is grounded in the fundamental observation that biological data has shape and this shape matters. Beneath the high-dimensional, often noisy landscape of gene expression profiles lie hidden topological structures (connected components, loops and voids) that capture the complex relationships driving cancer development and progression. By embracing this perspective, we position Topological Data Analysis (TDA) and persistent homology at the core of a novel analytical framework designed to tackle two key challenges in cancer research: clinical outcome prediction and biomarker discovery. In this study, we employ Weighted Gene Topological Data Analysis (WGTDA) to extract topological features from gene expression data, which serve as prognostic biomarkers for cancer classification, staging, and treatment response. Moreover, by integrating these topological features with machine learning models we aim to enhance the predictive accuracy for clinical outcomes. For clinical outcome prediction, we transformed gene expression profiles into topological fingerprints using multiple co-expression measures—namely, Pearson Correlation, Distance Correlation, and Weighted Topological Overlap (wTO) computed with both Pearson and Distance-based adjacencies. These topological features were analyzed using Random Forests. In parallel, we compared the predictive performance of traditional machine learning models (SVM, Gradient Boosting Decision Trees, Random Forest, and Neural Networks) trained on raw gene expression data against models incorporating the topological fingerprints. This comparative analysis was conducted across three classification tasks: cancer type (using TCGA-SARC, TCGA-PCPG, and TCGA-ESCA datasets), cancer staging (using TCGA-HNSC for stages I–IV), and treatment response (responders vs. non-responders). For biomarker identification, the same three tasks were applied using the best performing co-expression measure to generate a global topological representation of the patient population. This provided a disease-level view, highlighting shared homological patterns to facilitate biomarker discovery. Additionally, a dedicated visualization tool has been developed to aid in interpreting these topological signatures and identifying critical biomarkers. The tool is available at https://nnyase.github.io/MSc-Thesis/ WGTDA significantly enhanced phenotype prediction tasks by overcoming common pitfalls of traditional ML models in RNA-Seq data, such as overfitting and poor handling of class imbalance. TDA-derived features improved generalizability of ML models in tasks such as cancer staging and treatment response prediction. Our findings strongly support the integration of TDA into clinical outcome prediction, demonstrating its value in capturing nuanced patterns that allow ML methods to learn more effectively. Moreover, WGTDA remarkably identified key gene signatures for cancer type, staging, and treatment response without relying on pre-existing biological assumptions, yielding biomarkers that are strongly supported by the existing literature. These results underscore the method's reliability and potential clinical utility in precision oncology. 2026-01-14T11:31:04Z 2026-01-14T11:31:04Z 2025 2026-01-14T11:06:01Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/42574 en eng application/pdf Department of Statistical Sciences Faculty of Science University of Cape Town
spellingShingle Oncology
Topology-driven biomarker
Nyase, Ndivhuwo
Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
thesis_degree_str Master's
title Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
title_full Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
title_fullStr Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
title_full_unstemmed Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
title_short Exploring topological data analysis in gene expression data topology-driven biomarker discovery and clinical outcome prediction in oncology
title_sort exploring topological data analysis in gene expression data topology driven biomarker discovery and clinical outcome prediction in oncology
topic Oncology
Topology-driven biomarker
url http://hdl.handle.net/11427/42574
work_keys_str_mv AT nyasendivhuwo exploringtopologicaldataanalysisingeneexpressiondatatopologydrivenbiomarkerdiscoveryandclinicaloutcomepredictioninoncology