Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Influenza-A's ability to mutate constantly has resulted in recurring seasonal epidemics and pandemics. Recently, the virus's spread has been enhanced by its ability to infect multiple hosts simultaneously. Fast identification of the subtype and hosts of Influenza-A virus, is thus crucial, to quickly...
| Main Author: | |
|---|---|
| Format: | Thesis |
| Published: |
AUC Knowledge Fountain
2016
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613416471920640 |
|---|---|
| access_status_str | Open Access |
| author | Shaltout, Nermin Ashraf |
| author_browse | Shaltout, Nermin Ashraf |
| author_facet | Shaltout, Nermin Ashraf |
| author_sort | Shaltout, Nermin Ashraf |
| collection | Thesis |
| dc_rights_str_mv | The author retains all rights with regard to copyright. The author certifies that written permission from the owner(s) of third-party copyrighted matter included in the thesis, dissertation, paper, or record of study has been obtained. The author further certifies that IRB approval has been obtained for this thesis, or that IRB approval is not necessary for this thesis. Insofar as this thesis, dissertation, paper, or record of study is an educational record as defined in the Family Educational Rights and Privacy Act (FERPA) (20 USC 1232g), the author has granted consent to disclosure of it to anyone who requests a copy. |
| description | Influenza-A's ability to mutate constantly has resulted in recurring seasonal epidemics and pandemics. Recently, the virus's spread has been enhanced by its ability to infect multiple hosts simultaneously. Fast identification of the subtype and hosts of Influenza-A virus, is thus crucial, to quickly measure its drug resistance and virulence. Research in data mining techniques for influenza virus A host and subtype classification, has already been underway. The older studies' main goal was improving the accuracy, speed and safety of the virus analyses. With newer infectious strains of Influenza-A, appearing yearly, these techniques are still open for improvement. The current research plans to improve existing machine learning techniques for classifying Influenza-A by using the following methodologies: (a) Exploring the effectiveness of using RNA/cDNA data over protein data for virus classification. (b) Measuring the impact of preprocessing the virus, by selecting the most informative positions in the sequence, on classifier performance and speed; both neural networks (NNs) and decision trees (DTs) were analyzed. (c) Testing the previous method on more than one classification problem; host identification experiments were conducted on both subtype H1, and H5, while antiviral resistance identification was conducted on the H1N1 strain. Accuracy, sensitivity, specificity, precision and time were used as performance measures. The final results showed that: (a) DNA data is more sensitive than Protein data when using both subtypes. (b) Using the most 100 and 10 informative positions with DTs yielded an overall speed improvement of 92-100% when identifying hosts for segments of subtype H1. The performance decrease was insignificant. Using 100 and 60 informative positions with NNs yielded a speed improvement of 88% when identifying hosts of both subtypes H1, and H5. There was no significant drop in overall performance. Of the two classifiers: NNs had better performance, while DTs had better efficiency. (c) Testing the method on antiviral resistance identification of Influenza-A, showed promising results: Using the most 100 informative positions with DTs yielded an overall performance of not less than 95%, in not more than 3 seconds for all 8 segments. The method has the potential to improve the efficiency of other Influenza-A classification problems, as well as other viral classification problems in the Bioinformatics field. The thesis provided the following contributions: (a) A way to extract informative positions from DNA positions directly without converting the DNA data to protein data. This can aid in detecting silent mutations in Influenza-A virus. (b) Antiviral identification of Adamantane using all eight segments of the virus. Previously there was one known viral segment mainly responsible for antiviral resistance. (c) Measuring the efficiency of using informative positions, as a preprocessing step, in terms of speed. (d) A clear comparison between two classifier performances when using the information gain algorithm. |
| format | Thesis |
| id | oai:fount.aucegypt.edu:etds-2208 |
| institution | American University in Cairo (Egypt) |
| last_indexed | 2026-06-10T12:35:47.730Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from AUC Knowledge Fountain — bepress |
| publishDate | 2016 |
| publishDateRange | 2016 |
| publishDateSort | 2016 |
| publisher | AUC Knowledge Fountain |
| publisherStr | AUC Knowledge Fountain |
| record_format | dspace |
| source_str | AUC Knowledge Fountain — bepress |
| spelling | oai:fount.aucegypt.edu:etds-2208 Improving machine learning techniques for influenza-A classification Shaltout, Nermin Ashraf Influenza-A's ability to mutate constantly has resulted in recurring seasonal epidemics and pandemics. Recently, the virus's spread has been enhanced by its ability to infect multiple hosts simultaneously. Fast identification of the subtype and hosts of Influenza-A virus, is thus crucial, to quickly measure its drug resistance and virulence. Research in data mining techniques for influenza virus A host and subtype classification, has already been underway. The older studies' main goal was improving the accuracy, speed and safety of the virus analyses. With newer infectious strains of Influenza-A, appearing yearly, these techniques are still open for improvement. The current research plans to improve existing machine learning techniques for classifying Influenza-A by using the following methodologies: (a) Exploring the effectiveness of using RNA/cDNA data over protein data for virus classification. (b) Measuring the impact of preprocessing the virus, by selecting the most informative positions in the sequence, on classifier performance and speed; both neural networks (NNs) and decision trees (DTs) were analyzed. (c) Testing the previous method on more than one classification problem; host identification experiments were conducted on both subtype H1, and H5, while antiviral resistance identification was conducted on the H1N1 strain. Accuracy, sensitivity, specificity, precision and time were used as performance measures. The final results showed that: (a) DNA data is more sensitive than Protein data when using both subtypes. (b) Using the most 100 and 10 informative positions with DTs yielded an overall speed improvement of 92-100% when identifying hosts for segments of subtype H1. The performance decrease was insignificant. Using 100 and 60 informative positions with NNs yielded a speed improvement of 88% when identifying hosts of both subtypes H1, and H5. There was no significant drop in overall performance. Of the two classifiers: NNs had better performance, while DTs had better efficiency. (c) Testing the method on antiviral resistance identification of Influenza-A, showed promising results: Using the most 100 informative positions with DTs yielded an overall performance of not less than 95%, in not more than 3 seconds for all 8 segments. The method has the potential to improve the efficiency of other Influenza-A classification problems, as well as other viral classification problems in the Bioinformatics field. The thesis provided the following contributions: (a) A way to extract informative positions from DNA positions directly without converting the DNA data to protein data. This can aid in detecting silent mutations in Influenza-A virus. (b) Antiviral identification of Adamantane using all eight segments of the virus. Previously there was one known viral segment mainly responsible for antiviral resistance. (c) Measuring the efficiency of using informative positions, as a preprocessing step, in terms of speed. (d) A clear comparison between two classifier performances when using the information gain algorithm. 2016-06-01T07:00:00Z thesis application/pdf https://fount.aucegypt.edu/etds/1209 https://fount.aucegypt.edu/context/etds/article/2208/viewcontent/ImprovingInfluenzaAClassification.pdf The author retains all rights with regard to copyright. The author certifies that written permission from the owner(s) of third-party copyrighted matter included in the thesis, dissertation, paper, or record of study has been obtained. The author further certifies that IRB approval has been obtained for this thesis, or that IRB approval is not necessary for this thesis. Insofar as this thesis, dissertation, paper, or record of study is an educational record as defined in the Family Educational Rights and Privacy Act (FERPA) (20 USC 1232g), the author has granted consent to disclosure of it to anyone who requests a copy. Theses and Dissertations AUC Knowledge Fountain D Bioinformatics |
| spellingShingle | D Bioinformatics Shaltout, Nermin Ashraf Improving machine learning techniques for influenza-A classification |
| title | Improving machine learning techniques for influenza-A classification |
| title_full | Improving machine learning techniques for influenza-A classification |
| title_fullStr | Improving machine learning techniques for influenza-A classification |
| title_full_unstemmed | Improving machine learning techniques for influenza-A classification |
| title_short | Improving machine learning techniques for influenza-A classification |
| title_sort | improving machine learning techniques for influenza a classification |
| topic | D Bioinformatics |
| url | https://fount.aucegypt.edu/etds/1209 https://fount.aucegypt.edu/context/etds/article/2208/viewcontent/ImprovingInfluenzaAClassification.pdf |
| work_keys_str_mv | AT shaltoutnerminashraf improvingmachinelearningtechniquesforinfluenzaaclassification |