Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and...
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English English |
| Published: |
Computational Biology Division
2025
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613312987955200 |
|---|---|
| access_status_str | Open Access |
| author | Cullinan, Joshua |
| author2 | Martin, Darrin |
| author_browse | Cullinan, Joshua Martin, Darrin |
| author_facet | Martin, Darrin Cullinan, Joshua |
| author_sort | Cullinan, Joshua |
| collection | Thesis |
| description | This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy. |
| format | Thesis |
| id | oai:open.uct.ac.za:11427/42113 |
| institution | University of Cape Town (South Africa) |
| language | English eng |
| last_indexed | 2026-06-10T12:34:08.683Z |
| license_str | Not specified — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository |
| publishDate | 2025 |
| publishDateRange | 2025 |
| publishDateSort | 2025 |
| publisher | Computational Biology Division |
| publisherStr | Computational Biology Division |
| record_format | dspace |
| source_str | UCTD — University of Cape Town Open Access Repository |
| spelling | oai:open.uct.ac.za:11427/42113 Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification Cullinan, Joshua Martin, Darrin Medicine This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy. 2025-11-06T07:22:35Z 2025-11-06T07:22:35Z 2025 2025-11-06T07:12:26Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/42113 en eng application/pdf Computational Biology Division Faculty of Health Sciences |
| spellingShingle | Medicine Cullinan, Joshua Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| thesis_degree_str | Master's |
| title | Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| title_full | Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| title_fullStr | Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| title_full_unstemmed | Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| title_short | Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| title_sort | utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification |
| topic | Medicine |
| url | http://hdl.handle.net/11427/42113 |
| work_keys_str_mv | AT cullinanjoshua utilisingmachinelearningtechniquesonsimulatedviralevolutiondatasetstoimproveviralrecombinantidentification |