Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Robberts, Sinead. 2021. The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/658dd3c5-ac91-427b-8c71-33ecef268cef
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Stellenbosch : Stellenbosch University
2021
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613904651157504 |
|---|---|
| access_status_str | Open Access |
| author | Robberts, Sinead |
| author2 | Bardien, Soraya |
| author_browse | Bardien, Soraya Robberts, Sinead |
| author_facet | Bardien, Soraya Robberts, Sinead |
| author_sort | Robberts, Sinead |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Robberts, Sinead. 2021. The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/658dd3c5-ac91-427b-8c71-33ecef268cef |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/110436 |
| institution | Stellenbosch University (South Africa) |
| language | English |
| last_indexed | 2026-06-10T12:43:33.723Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2021 |
| publishDateRange | 2021 |
| publishDateSort | 2021 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/110436 The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels Robberts, Sinead Bardien, Soraya Vorster, Alvera Stellenbosch University. Faculty of Medicine and Health Sciences. Dept. of Biomedical Sciences. Molecular Biology and Human Genetics. DNA variants detection; human genetics; Parkinson’s disease (PD); miscalled variants High-throughput nucleotide sequencing Human genetics -- Variation DNA -- Analysis UCTD Robberts, Sinead. 2021. The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/658dd3c5-ac91-427b-8c71-33ecef268cef Thesis (MSc)--Stellenbosch University, 2021. ENGLISH ABSTRACT: The exponential growth of massively parallel sequencing (MPS) applications for DNA variant detection has transformed the field of human genetics. However, inaccurate variant calling deriving from MPS technical and biological artefacts has caused researchers to rely on various validation methods. Notably, the consequence of variant miscalling is that a disease-causing mutation might be missed for a particular patient leading to delayed or inaccurate clinical diagnosis. This MPS technology had been used in previous studies which involved screening for mutations in African individuals with Parkinson’s disease (PD). In these studies, six independent datasets were produced for one PD affected individual: five had been generated using the Ion AmpliSeq™ Neurological research panel, and one had been generated using the Agilent SureSelect™ custom-designed PD gene panel. Interestingly, discordance in variant calling for this individual was observed, underlining the ambiguity of variant calling using MPS. Therefore, the aim of the present study was to assess the variants called within this unique collection of six MPS datasets for one individual, and to select predicted true and false positive variants for validation, whilst identifying technical artefacts influencing variant calling. The vcftools suite was used to calculate the concordance between the number of variants called in the five AmpliSeq™ Variant Call Format files (VCFf). This illustrated the degree of miscalled variants across multiple datasets representing the same individual. A 66.1% (n=3502/5297) concordance was calculated across the five VCFfs. However, when a hotspot file was included during variant calling, the concordance increased to 84.8%. A hotspot file ‘instructs’ the Torrent Variant Caller software to strictly call variants within specific genomic positions. To determine the factors influencing the variance within the AmpliSeq™ merged VCFf, a principal component analysis (PCA) was performed using the R Studio® package to construct multidimensional principal components (PC). These factors within the PCA represented the VCFf informative quality metrics (IQM). From this, the influential IQMs could be identified and its effect on variant calling could be assessed. The PCA findings indicated that 95% of the observed variance in the dataset could be accounted for by three PCs, namely the depth of coverage (DP), allele frequency (AF) and genotyping quality (GQ). By using the annovar annotation software suite, variants were characterized with functional features such as gene names and variant regions. With these findings, 37 DNA variants were selected for Sanger sequencing validation based on the most influential IQMs (DP, GQ, and AF) identified from the PCA, or based on their location in PD-associated genes. Of 37 variants, 36 were successfully validated using Sanger sequencing and 91.7% (n = 33/36) were classified as true positive variants. Three variants, which were selected based on low DP scores, were false positive variants. However, eight true variants displayed low alternate alleles that were not called using the Sanger sequencing analysis software with default settings, indicating possible allelic imbalance. Unexpectedly, 17 ‘additional’ variants were found during validation analysis and 41.2% (n = 7/17) were classified as false negative variants. These variants were not called in the five MPS datasets even though their genomic regions were targeted. Moreover, to determine discordant variant calls between two gene panels, the bedtools suite was used to identify variants called in genomic regions that were covered by both gene panels. Nine variants were found to be discordant between the two panels and were selected for validation using Sanger sequencing. Of these, three variants were uniquely called by the AmpliSeq™ panel and six variants were uniquely called by the SureSelect™ panel. These were validated as true variants, although allelic imbalance and homopolymer stretches were observed. Evidently, the concordance calculated between the number of variants called in multiple MPS datasets and between the two gene panels for the same individual, highlighted several variant calling inaccuracies. This study’s criteria for selection of variants for validation using IQM scores, gave insight into factors influencing variant calling, of which DP is a major contributing factor. Therefore, these findings are important for improving the confidence in MPS data analysis and diminishing the reliance on validation methods for variants called in MPS. Although using Sanger sequencing for validation was useful, miscalled MPS variants caused by artefacts such as allelic imbalance should be analyzed by manual inspection or validated with an alternate method. In conclusion, this study revealed that performing assessments on MPS datasets are required to understand the paradigms of accurate MPS variant calling, and to identify variants necessary for validation. The field of precision medicine has provided significant breakthroughs into the pathobiology of various disorders, but it is fundamentally dependent on technologies such as MPS. It is therefore critical to identify and manage artefacts currently limiting the use of MPS data to facilitate its broader application to the study of human disease. AFRIKAANSE OPSOMMING: Die studieveld van mensgenetika is transformeer deur die eksponensiële groei van grootskaalse parallele volgordebepaling (MPS) toepassings vir DNS variant bespeuring. Nietemin, onakkurate variant roeping, as gevolg van MPS tegniese en biologiese artefakte, het veroorsaak dat navorsers moet staatmaak op verskeie validasiemetodes. Die gevolg van variant misroeping is dat ‘n siekteveroorsakende mutasie oorsien kan word vir ‘n spesifieke pasiënt wat kan lei tot ‘n vertraagde of onakkurate kliniese diagnose. Die MPS tegnologie is voorheen gebruik in studies, wat die sifting van mutasies behels het, in individue van Afrika met Parkinson se Siekte (PD). In hierdie studies is ses onafhanklike datastelle gegenereer vir een PDgeaffekteerde individu: vyf is gegenereer deur die Ion AmpliSeq™ Neurologiese navorsingspaneel, en een is gegenereer deur die Agilent SureSelect™ pasgemaakte PD geenpaneel. Interessant genoeg, is verskille in variant roeping waargeneem vir hierdie individu wat beklemteen hoe wisselvallig variant roeping deur MPS is. Die doel van die huidige studie was dus om die variante wat geroep is te ondersoek in hierdie unieke versameling van ses MPS datastelle vir een individu, en om voorspelde waar en vals positiewe variante vir validasie te selekteer, terwyl tegniese artefakte wat variant roeping beЇnvloed identifiseer word. Die vcftools suite is gebruik om die ooreenstemming te bereken tussen die aantal variante geroep in die vyf AmpliSeq™ Variant Roeping Formaat leêrs (VCFf). Dit het die mate van misgeroepde variante illustreer oor veelvuldige datastelle wat dieselfde individu verteenwoordig. ‘n Ooreenstemming van 66.1% (n=3502/5297) is bereken oor die vyf VCFfs. Nietemin, die ooreenstemming is verhoog tot 84.8% deur ‘n fokuspunt leêr in te sluit gedurende variant roeping. ‘n Fokuspunt leêr verskaf ‘instruksies’ aan die Torrent Variant Caller sagteware om alle variante te roep in spesifieke genomiese posisies. Om die faktore te bepaal wat die variansie binne die AmpliSeq™ saamgesmelte VCFf beЇnvloed, is ‘n prinsipale komponent analise (PCA) uitgevoer deur die R Studio® pakket om multidimensionele prinsipale komponente te konstrueer. Hierdie faktore in die PCA verteenwoordig die VCFf informatiewe kwaliteitmetrieke (IQM). Gevolglik, kon die invloedryke IQMs geЇdentifiseer word en die effek daarvan op variant roeping kon ondersoek word. Die PCA bevindinge het aangedui dat 95% van die waargeneemde datastel variansie toegeskryf kon word aan drie PCs, naamlik die diepte van dekking (DP), alleel frekwensie (AF) en die genotipering kwaliteit (GQ). Variante is gekaraktariseer met funksionele kenmerke soos geenname en variantgebiede deur die annovar annotasie sagteware suite te gebruik. Gevolglik, is 37 DNS variante selekteer vir validasie deur Sanger-volgordebepaling gebasseer op die mees invloedryke IQMs (DP, GQ en AF) geЇdentifiseer van die PCA, of gebasseer op hul ligging in PD-geassosieerde gene. Van die 37 variante, is 36 suksesvol gevalideer deur Sanger-volgordebepaling te gebruik en 91.7% (n=33/36) is geklassifiseer as ware positiewe variante. Drie variante, wat geselekteer is op grond van lae DP tellings, was vals variante. Agt ware variante het egter lae alternatiewe allele getoon wat nie geroep is deur die Sanger-volgordebepaling analisesagteware met verstekstellings nie, wat dui op moontlike alleliese wanbalans. Onverwags, is 17 ‘addisionele’ variante gevind gedurende validasie analise en 41.2% (n=7/17) is geklassifiseer as vals negatiewe variante. Hierdie variante is nie geroep in die vyf MPS datastelle nie, selfs al was hul genomiese gebied geteiken. Verder, is die bedtools suite gebruik om variante te identifiseer wat geroep is in genomiese gebiede gedek deur albei geenpanele, ten einde om diskordante variant roepings tussen twee geenpanele te bepaal. Nege variante is gevind om diskordant te wees tussen die twee panele en was dus geselekteer vir validasie deur Sanger-volgordebepaling. Drie van hierdie variante is uniek geroep deur die AmpliSeq™ paneel en ses variante is uniek geroep deur die SureSelect™ paneel. Hierdie is gevalideer as ware variante al is alleliese wanbalans en homopolimeer streke waargeneem. Klaarblyklik, het die ooreenstemming bereken tussen die aantal variante geroep in veelvuldige MPS datastelle en tussen die twee geenpanele vir dieselfde individu, verskeie variant roeping onakkuraathede beklemtoon. Hierdie studie se kriteria vir die seleksie van variante vir validasie deur IQM tellings te gebruik het insig gegee tot die faktore wat variant roeping beЇnvloed, waarvan DP ‘n groot bydraende faktor is. Hierdie bevindinge is dus belangrik om die vertroue in MPS data-analise te verbeter en om die afhanklikheid van validasiemetodes vir variante geroep in MPS te verminder. Al was die gebruik van Sangervolgordebepaling vir validasie nuttig, moet misgeroepde MPS variante veroorsaak deur artefakte, soos alleliese wanbalans, geanaliseer word deur handmatige inspeksie of gevalideer word deur ‘n alternatiewe metode. Ten slotte, het hierdie studie onthul dat die uitvoering van assesserings op MPS datastelle vereis word om die paradigmas van akkurate MPS variant roeping te verstaan en om variante te identifiseer wat nodig is vir validasie. Die veld van presisie-medikasie het beduidende deurbrake voorsien in die patobiologie van verskeie afwykings, maar dit is fundamenteel afhanklik van tegnologieë soos MPS. Dit is dus krities om artefakte te identifiseer en te bestuur, wat tans die gebruik van MPS data beperk, om die breër toepassing daarvan tot die studie van menslike siektes te fasiliteer. Masters 2021-04-30T10:27:25Z 2021-04-30T10:27:25Z 2021-03 Thesis http://hdl.handle.net/10019.1/110436 en Stellenbosch University xvi, 117 pages : illustrations application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | DNA variants detection; human genetics; Parkinson’s disease (PD); miscalled variants High-throughput nucleotide sequencing Human genetics -- Variation DNA -- Analysis UCTD Robberts, Sinead The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title | The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title_full | The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title_fullStr | The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title_full_unstemmed | The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title_short | The assessment and validation of DNA variants detected by massively parallel sequencing using gene panels |
| title_sort | assessment and validation of dna variants detected by massively parallel sequencing using gene panels |
| topic | DNA variants detection; human genetics; Parkinson’s disease (PD); miscalled variants High-throughput nucleotide sequencing Human genetics -- Variation DNA -- Analysis UCTD |
| url | http://hdl.handle.net/10019.1/110436 |
| work_keys_str_mv | AT robbertssinead theassessmentandvalidationofdnavariantsdetectedbymassivelyparallelsequencingusinggenepanels AT robbertssinead assessmentandvalidationofdnavariantsdetectedbymassivelyparallelsequencingusinggenepanels |