Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification

Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurat...

Full description

Saved in:
Bibliographic Details
Main Author: Swanepoel, Phillip
Other Authors: Martin, Darrin
Format: Thesis
Language:English
Published: Computational Biology Division 2024
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613257022308352
access_status_str Open Access
author Swanepoel, Phillip
author2 Martin, Darrin
author_browse Martin, Darrin
Swanepoel, Phillip
author_facet Martin, Darrin
Swanepoel, Phillip
author_sort Swanepoel, Phillip
collection Thesis
description Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align.
format Thesis
id oai:open.uct.ac.za:11427/39870
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:33:15.376Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2024
publishDateRange 2024
publishDateSort 2024
publisher Computational Biology Division
publisherStr Computational Biology Division
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/39870 Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification Swanepoel, Phillip Martin, Darrin Medicine Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align. 2024-06-05T13:17:14Z 2024-06-05T13:17:14Z 2023 2024-06-05T12:51:27Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/39870 eng application/pdf Computational Biology Division Faculty of Health Sciences
spellingShingle Medicine
Swanepoel, Phillip
Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
thesis_degree_str Master's
title Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
title_full Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
title_fullStr Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
title_full_unstemmed Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
title_short Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
title_sort simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification
topic Medicine
url http://hdl.handle.net/11427/39870
work_keys_str_mv AT swanepoelphillip simulatingrecombinantsequencedatetoevaluateandimprovecomputationalmethodsofmultiplesequencealignmentandrecombinantidentification