Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MSc)--Stellenbosch University, 2025.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Published: |
Stellenbosch : Stellenbosch University
2026
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867614136065589248 |
|---|---|
| access_status_str | Open Access |
| author | Naidoo, Lorensha |
| author2 | Patterton, Hugh-George |
| author_browse | Naidoo, Lorensha Patterton, Hugh-George |
| author_facet | Patterton, Hugh-George Naidoo, Lorensha |
| author_sort | Naidoo, Lorensha |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MSc)--Stellenbosch University, 2025. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/134730 |
| institution | Stellenbosch University (South Africa) |
| last_indexed | 2026-06-10T12:47:14.419Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2026 |
| publishDateRange | 2026 |
| publishDateSort | 2026 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/134730 Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics Naidoo, Lorensha Patterton, Hugh-George Vaudel, Marc Stellenbosch University. Faculty of Science. Centre for Bioinformatics & Computational Biology. Diabetes in children -- Genetic aspects Proteomics -- Data processing Proteins -- Analysis Peptides -- Analysis Spectrum analysis -- Data processing Machine learning -- Computer simulation Bayesian statistical decision theory -- Computer simulation UCTD Thesis (MSc)--Stellenbosch University, 2025. Naidoo, L. 2025. Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/fa2ce23c-42d1-4183-b3ba-b969d378e482 ENGLISH ABSTRACT: Monogenic diabetes is a rare form of paediatric diabetes caused by a pathogenic variant occurring in a single gene associated with insulin production from pancreatic β-cells, resulting in hyperglycaemic complications. Maturity-onset diabetes of the young (MODY) is a subtype of monogenic diabetes, accounting for 2 – 5% of diabetic cases. Patient classification has revealed undiagnosed symptomatic cohorts speculated to result from unknown genetic variants. Identifying these variants is essential for advancing precision medicine and providing specialised medical care. Proteogenomics allows for the identification of alternative forms of proteins resulting from genomic variation. Protein samples are processed using mass spectrometry to obtain peptide sequences that are then annotated to a sequence database using search engines, e.g. SEQUEST and X!Tandem. Peptide-spectrum matches (PSMs) with varying confidence scores are produced; however, there is no clear distinction between correct and incorrect PSMs. Additionally, distinguishing variant PSMs from canonical PSMs remains challenging due to their low frequency and sequence similarity. The target-decoy approach (TDA) is a common method for classifying correct and incorrect PSMs and is used in existing PSM processing tools, such as Percolator. Target sequences are peptide sequences from proteomic databases, while decoy sequences are artificially generated to serve as a null model for error rate estimation. To the knowledge of this work, the TDA has not been implemented towards improving the discrimination performance of variant PSMs. To this end, this study conducts an exploratory analysis to improve the classification of canonical and variant PSMs using the TDA. To achieve this, a machine learning (ML) classification pipeline named Nagilums Tree written in Python, is designed to classify a PSM dataset consisting of target and decoy labelled spectra characterised by multiple scoring functions issued during annotation. ML base models are built using Scikit-learn, which produces a prediction probability 𝑝(-1) test statistic that is used to rank classified spectra in order of significance. Spectra ranking is necessary for statistical inference and estimating error rate metrics for the three implementation tasks investigated in this work. First, the discrimination performance of five decision tree ensemble architectures (Random Forest, Gradient Boosting, Histogram Gradient Boosting, Extra Trees and XGBoost) is evaluated. Second, the novel concept of decoy variant PSMs is investigated by processing canonical and variant PSMs of proteomic data from pluripotent stem cells induced into pancreatic β-cells with an HNF1A (MODY3) variant, and its performance is compared to the classical TDA. Lastly, the two decoy strategies are compared in their ability to produce statistically significant MODY-associated PSMs by exploring a novel Bayesian inferred posterior error probability (PEP) method. Ultimately, Extra Trees produced the most reassuring performance and was used for the second task. The decoy variant PSM strategy improved the probability fitness of the variant PSMs; however, the PEP estimates did not identify the HNF1A variant in either decoy method. Although MODY gene expression was inconclusive, the decoy variant concept proved to be an optimistic starting point for future research. The influence of proteogenomic strategies, ML and statistical inference is discussed for future implementation. AFRIKAANSE OPSOMMING: Geen opsomming beskikbaar. Masters 2026-01-05T13:50:19Z 2026-01-05T13:50:19Z 2025-12 Thesis https://scholar.sun.ac.za/handle/10019.1/134730 Stellenbosch University xiv, 143 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Diabetes in children -- Genetic aspects Proteomics -- Data processing Proteins -- Analysis Peptides -- Analysis Spectrum analysis -- Data processing Machine learning -- Computer simulation Bayesian statistical decision theory -- Computer simulation UCTD Naidoo, Lorensha Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title | Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title_full | Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title_fullStr | Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title_full_unstemmed | Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title_short | Assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| title_sort | assessing the discoverability of variant proteins causing rare forms of paediatric diabetes using proteogenomics |
| topic | Diabetes in children -- Genetic aspects Proteomics -- Data processing Proteins -- Analysis Peptides -- Analysis Spectrum analysis -- Data processing Machine learning -- Computer simulation Bayesian statistical decision theory -- Computer simulation UCTD |
| url | https://scholar.sun.ac.za/handle/10019.1/134730 |
| work_keys_str_mv | AT naidoolorensha assessingthediscoverabilityofvariantproteinscausingrareformsofpaediatricdiabetesusingproteogenomics |