Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics

Thesis (MSc)--Stellenbosch University, 2025.

Saved in:
Bibliographic Details
Main Author: Naidoo, Lorensha
Other Authors: Patterton, Hugh-George
Format: Thesis
Published: Stellenbosch : Stellenbosch University 2026
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614136065589248
access_status_str Open Access
author Naidoo, Lorensha
author2 Patterton, Hugh-George
author_browse Naidoo, Lorensha
Patterton, Hugh-George
author_facet Patterton, Hugh-George
Naidoo, Lorensha
author_sort Naidoo, Lorensha
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc)--Stellenbosch University, 2025.
format Thesis
id oai:scholar.sun.ac.za:10019.1/134730
institution Stellenbosch University (South Africa)
last_indexed 2026-06-10T12:47:14.419Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/134730 Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics Naidoo, Lorensha Patterton, Hugh-George Vaudel, Marc Stellenbosch University. Faculty of Science. Centre for Bioinformatics & Computational Biology. Diabetes in children -- Genetic aspects Proteomics -- Data processing Proteins -- Analysis Peptides -- Analysis Spectrum analysis -- Data processing Machine learning -- Computer simulation Bayesian statistical decision theory -- Computer simulation UCTD Thesis (MSc)--Stellenbosch University, 2025. Naidoo, L. 2025. Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/fa2ce23c-42d1-4183-b3ba-b969d378e482 ENGLISH ABSTRACT: Monogenic diabetes is a rare form of paediatric diabetes caused by a pathogenic variant occurring in a single gene associated with insulin production from pancreatic β-cells, resulting in hyperglycaemic complications. Maturity-onset diabetes of the young (MODY) is a subtype of monogenic diabetes, accounting for 2 – 5% of diabetic cases. Patient classification has revealed undiagnosed symptomatic cohorts speculated to result from unknown genetic variants. Identifying these variants is essential for advancing precision medicine and providing specialised medical care. Proteogenomics allows for the identification of alternative forms of proteins resulting from genomic variation. Protein samples are processed using mass spectrometry to obtain peptide sequences that are then annotated to a sequence database using search engines, e.g. SEQUEST and X!Tandem. Peptide-spectrum matches (PSMs) with varying confidence scores are produced; however, there is no clear distinction between correct and incorrect PSMs. Additionally, distinguishing variant PSMs from canonical PSMs remains challenging due to their low frequency and sequence similarity. The target-decoy approach (TDA) is a common method for classifying correct and incorrect PSMs and is used in existing PSM processing tools, such as Percolator. Target sequences are peptide sequences from proteomic databases, while decoy sequences are artificially generated to serve as a null model for error rate estimation. To the knowledge of this work, the TDA has not been implemented towards improving the discrimination performance of variant PSMs. To this end, this study conducts an exploratory analysis to improve the classification of canonical and variant PSMs using the TDA. To achieve this, a machine learning (ML) classification pipeline named Nagilums Tree written in Python, is designed to classify a PSM dataset consisting of target and decoy labelled spectra characterised by multiple scoring functions issued during annotation. ML base models are built using Scikit-learn, which produces a prediction probability 𝑝(-1) test statistic that is used to rank classified spectra in order of significance. Spectra ranking is necessary for statistical inference and estimating error rate metrics for the three implementation tasks investigated in this work. First, the discrimination performance of five decision tree ensemble architectures (Random Forest, Gradient Boosting, Histogram Gradient Boosting, Extra Trees and XGBoost) is evaluated. Second, the novel concept of decoy variant PSMs is investigated by processing canonical and variant PSMs of proteomic data from pluripotent stem cells induced into pancreatic β-cells with an HNF1A (MODY3) variant, and its performance is compared to the classical TDA. Lastly, the two decoy strategies are compared in their ability to produce statistically significant MODY-associated PSMs by exploring a novel Bayesian inferred posterior error probability (PEP) method. Ultimately, Extra Trees produced the most reassuring performance and was used for the second task. The decoy variant PSM strategy improved the probability fitness of the variant PSMs; however, the PEP estimates did not identify the HNF1A variant in either decoy method. Although MODY gene expression was inconclusive, the decoy variant concept proved to be an optimistic starting point for future research. The influence of proteogenomic strategies, ML and statistical inference is discussed for future implementation. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Masters 2026-01-05T13:50:19Z 2026-01-05T13:50:19Z 2025-12 Thesis https://scholar.sun.ac.za/handle/10019.1/134730 Stellenbosch University xiv, 143 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Diabetes in children -- Genetic aspects
Proteomics -- Data processing
Proteins -- Analysis
Peptides -- Analysis
Spectrum analysis -- Data processing
Machine learning -- Computer simulation
Bayesian statistical decision theory -- Computer simulation
UCTD
Naidoo, Lorensha
Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title_full Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title_fullStr Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title_full_unstemmed Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title_short Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
title_sort assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics
topic Diabetes in children -- Genetic aspects
Proteomics -- Data processing
Proteins -- Analysis
Peptides -- Analysis
Spectrum analysis -- Data processing
Machine learning -- Computer simulation
Bayesian statistical decision theory -- Computer simulation
UCTD
url https://scholar.sun.ac.za/handle/10019.1/134730
work_keys_str_mv AT naidoolorensha assessingthediscoverabilityofvariantproteinscausingrareformsofpaediatricdiabetesusingproteogenomics