Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design

Thesis (MSc)--Stellenbosch University, 2021.

Saved in:
Bibliographic Details
Main Author: Zimire, Darryn
Other Authors: Tromp, Gerard
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2021
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613777571086336
access_status_str Open Access
author Zimire, Darryn
author2 Tromp, Gerard
author_browse Tromp, Gerard
Zimire, Darryn
author_facet Tromp, Gerard
Zimire, Darryn
author_sort Zimire, Darryn
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc)--Stellenbosch University, 2021.
format Thesis
id oai:scholar.sun.ac.za:10019.1/123822
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:41:32.562Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2021
publishDateRange 2021
publishDateSort 2021
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/123822 Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design Zimire, Darryn Tromp, Gerard Stellenbosch University. Faculty of Medicine and Health Sciences. Dept. of Biomedical Sciences: Molecular Biology and Human Genetics. RNA-sequencing Biotechnology Molecular biology UCTD Thesis (MSc)--Stellenbosch University, 2021. ENGLISH ABSTRACT: RNA-sequencing (RNA-seq) is a quantitative high-throughput sequencing biotechnology developed to analyse and provide insights into the molecular biology of the transcriptome. An appropriate experimental design and analysis strategy for RNA-seq experiments is essential and requires statistical methods suited to model the characteristics of sequencing data which take the form of a matrix with the number of reads per genomic feature as a digital estimate of relative expression. Sequencing depth, read length and data quality are of particular importance for planning and analysing RNA-seq experiments as these factors can be decided before conducting the experiment. The number of reads generated for a particular experiment affects the statistical power to make biological conclusions. Read length coupled with its associated quality influences the mappability of the sequencing data and in turn has an impact on information loss. Shorter reads tend to map to multiple locations when aligned to the reference genome or transcriptome. The quality of the data also affects the downstream analysis and can result in the discarding of data, diminishing the ability to establish biological insights with confidence from the experimental data. To assist in the design of RNA-seq experiments, I present an RNA-seq data simulator (RSDS), which is a proof-of-concept computer simulator written in the Python programming language for raw RNA- seq data simulations. RSDS allows for simulation of both single-end and paired-end RNA-seq data with sequencing depth, read length, and base-call quality as tuneable settings. A two-group differential expression experiment can be simulated using RSDS. I describe, validate and implement the RSDS simulator and demonstrate its use for generation of raw synthetic RNA-seq data by varying the parameter values of sequencing depth, read length, and base-call quality. I demonstrate the ability of RSDS to reproduce a transcript expression profile from an input matrix of read counts derived from a real RNA-seq experiment and produce a two-group differential experiment with varying fold-changes and expression levels. AFRIKAANSE OPSOMMING: Die ontwikkeling van kwantitatief sequencing tegnologie, soos RNA-sequencing (RNA-seq) het n’ groot insig in molekulere biologie vasgestel. Behoorlike ontwerp and analise van die eksperimente benodig statistiese modelle en tegnieke wat die aard van sequencing data in ag neem, wat gewoonlik bestaan uit n’ matriks van lees-tellings per funksie. n’ Kwessie van besondere belang vir die ontwikkeling van hierdie metodes en ontwerp van die eksperimente is die rol van volgorde diepte, leeslengte en datakwaliteit. Die diepte van n’ eksperiment beinvloed die vermoe om biologiese gevolgtrekkings te maak, wat beteken dat n’ eksperimentontwerp die afweging tussen koste, statistiese krag en die aantal monsters wat ondersoek word, moet in ag neem. Leeslengte tesame met die gepaargaande kwaliteit daarvan is n’ belangrike oorweging vir elke eksperiment opeenvolgorde, want dit beinvloed die lot van n’ sequence wat gelees word na die kartering van n’ verwysingsgenoom. Korter reads is geneig om op meer as een plek te karteer as dit in lyn is met die verwysingsgenoom en word dikwels weggegooi, wat lei to verlies aan biologiese inligting. In hierdie proefskrif ondersoek ek die effekte van sequencing diepte, read lengte en datakwaliteit op die ontwerp en analise van RNA-seq eksperimente. Om te help met die ontwerp van RNA-seq eksperimente, bied ek RNA-seq Data Simulator (RSDS) aan, wat n’ bewys van konsep rekenaarsimulator is wat in Python programmeertaal geskryf is vir rou RNA-seq data simulasies. RSDS maak voorsiening vir simulasies van beide enkel en gekoppelde RNA-seq data met volgorde diepte leeslengte en basisoproep kwaliteit as instelbare instellings. DIt bied ook die vermoeie aan om n’ twee-groep differential geen uitdrukking te simuleer. Ek beskryf, bekragtig en implementeer die RSDS-simulator en demonstreer die gebruik daarvan om rou RNA-seq data te produseer deur die parameterwaardes van volgorde diepte, leeslengte en basisoproepkwaliteit te varieer. Ek demontreer ook die vermoe van RSDS om n’ transkripsie-uitdrukkings profiel weer te gee vanaf n’ invoermatriks van lees-tellings afgelei van n’ werklike RNA-seq eksperiment. Masters 2021-11-24T06:46:57Z 2021-12-22T14:23:25Z 2021-11-24T06:46:57Z 2021-12-22T14:23:25Z 2021-12 Thesis http://hdl.handle.net/10019.1/123822 en_ZA Stellenbosch University 146 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle RNA-sequencing
Biotechnology
Molecular biology
UCTD
Zimire, Darryn
Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title_full Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title_fullStr Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title_full_unstemmed Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title_short Simulating read length, sequencing depth and base-call quality for RNAsequencing experimental design
title_sort simulating read length sequencing depth and base call quality for rnasequencing experimental design
topic RNA-sequencing
Biotechnology
Molecular biology
UCTD
url http://hdl.handle.net/10019.1/123822
work_keys_str_mv AT zimiredarryn simulatingreadlengthsequencingdepthandbasecallqualityforrnasequencingexperimentaldesign