Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurat...
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Computational Biology Division
2024
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613257022308352 |
|---|---|
| access_status_str | Open Access |
| author | Swanepoel, Phillip |
| author2 | Martin, Darrin |
| author_browse | Martin, Darrin Swanepoel, Phillip |
| author_facet | Martin, Darrin Swanepoel, Phillip |
| author_sort | Swanepoel, Phillip |
| collection | Thesis |
| description | Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align. |
| format | Thesis |
| id | oai:open.uct.ac.za:11427/39870 |
| institution | University of Cape Town (South Africa) |
| language | eng |
| last_indexed | 2026-06-10T12:33:15.376Z |
| license_str | Not specified — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository |
| publishDate | 2024 |
| publishDateRange | 2024 |
| publishDateSort | 2024 |
| publisher | Computational Biology Division |
| publisherStr | Computational Biology Division |
| record_format | dspace |
| source_str | UCTD — University of Cape Town Open Access Repository |
| spelling | oai:open.uct.ac.za:11427/39870 Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification Swanepoel, Phillip Martin, Darrin Medicine Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align. 2024-06-05T13:17:14Z 2024-06-05T13:17:14Z 2023 2024-06-05T12:51:27Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/39870 eng application/pdf Computational Biology Division Faculty of Health Sciences |
| spellingShingle | Medicine Swanepoel, Phillip Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| thesis_degree_str | Master's |
| title | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| title_full | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| title_fullStr | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| title_full_unstemmed | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| title_short | Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| title_sort | simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification |
| topic | Medicine |
| url | http://hdl.handle.net/11427/39870 |
| work_keys_str_mv | AT swanepoelphillip simulatingrecombinantsequencedatetoevaluateandimprovecomputationalmethodsofmultiplesequencealignmentandrecombinantidentification |