Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification

This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and...

Full description

Saved in:
Bibliographic Details
Main Author: Cullinan, Joshua
Other Authors: Martin, Darrin
Format: Thesis
Language:English
English
Published: Computational Biology Division 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613312987955200
access_status_str Open Access
author Cullinan, Joshua
author2 Martin, Darrin
author_browse Cullinan, Joshua
Martin, Darrin
author_facet Martin, Darrin
Cullinan, Joshua
author_sort Cullinan, Joshua
collection Thesis
description This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy.
format Thesis
id oai:open.uct.ac.za:11427/42113
institution University of Cape Town (South Africa)
language English
eng
last_indexed 2026-06-10T12:34:08.683Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Computational Biology Division
publisherStr Computational Biology Division
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/42113 Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification Cullinan, Joshua Martin, Darrin Medicine This thesis explores machine learning applications for enhancing viral recombination detection. Using SANTA-SIM-generated viral evolution data, multiple computational approaches were developed and evaluated against existing methods in the Recombination Detection Program (RDP5). The study trained and tested several models, including logistic regression, gradient boosting, random forests and neural networks, on a dataset of 491 124 sequences. A novel neural network architecture employing position selection achieved the highest performance with a weighted Area Under Curve (AUC) of 0.784, surpassing RDP5's baseline AUC of 0.739. The gradient boosting classifier demonstrated strong results with an AUC of 0.765, whilst the binary neural network achieved 0.764. Performance evaluation focused on precision, recall and F1-scores to address the inherent class imbalance between recombinant and parental sequences. The models demonstrated modest performance in detecting recombinants (precision 0.627-0.687, recall 0.652-0.686). These improvements, though incremental, represent progress in automated recombination detection. The successful preliminary integration of the logistic regression model into RDP5 demonstrates the practical applicability of these approaches. This work provides a foundation for enhancing viral recombination detection through machine learning, whilst highlighting areas requiring further development to achieve more substantial improvements in detection accuracy. 2025-11-06T07:22:35Z 2025-11-06T07:22:35Z 2025 2025-11-06T07:12:26Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/42113 en eng application/pdf Computational Biology Division Faculty of Health Sciences
spellingShingle Medicine
Cullinan, Joshua
Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
thesis_degree_str Master's
title Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
title_full Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
title_fullStr Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
title_full_unstemmed Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
title_short Utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
title_sort utilising machine learning techniques on simulated viral evolution datasets to improve viral recombinant identification
topic Medicine
url http://hdl.handle.net/11427/42113
work_keys_str_mv AT cullinanjoshua utilisingmachinelearningtechniquesonsimulatedviralevolutiondatasetstoimproveviralrecombinantidentification