Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Feature engineered embeddings for machine learning on molecular data

Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.

Saved in:
Bibliographic Details
Other Authors: De Waal, Alta
Format: Thesis
Language:English
Published: University of Pretoria 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613638340116480
access_status_str Open Access
author2 De Waal, Alta
author_browse De Waal, Alta
author_facet De Waal, Alta
collection Thesis
dc_rights_str_mv © 2022 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.
format Thesis
id oai:repository.up.ac.za:2263/89279
institution University of Pretoria (South Africa)
language English
last_indexed 2026-06-10T12:39:19.648Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher University of Pretoria
publisherStr University of Pretoria
record_format dspace
source_str UPSpace — University of Pretoria Institutional Repository
spelling oai:repository.up.ac.za:2263/89279 Feature engineered embeddings for machine learning on molecular data De Waal, Alta u17029008@tuks.co.za Jardim, Claudio UCTD Machine learning Data science Statistics Biology Molecules Embeddings Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022. The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present a different approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text data. Through this approach we aim to make a robust and explainable embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these explainable embeddings for machine learning models, for representing a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity. We apply the techniques on three different types of molecular text data: FASTA sequence data, Simplified Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these embeddings provide excellent performance for classification and protein-protein bind prediction. Statistics MSc (Advanced Data Analytics) Unrestricted 2023-02-08T06:50:28Z 2023-02-08T06:50:28Z 2023-05 2022 Mini Dissertation * A2023 https://repository.up.ac.za/handle/2263/89279 10.25403/UPresearchdata.22043297 en © 2022 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle UCTD
Machine learning
Data science
Statistics
Biology
Molecules
Embeddings
Feature engineered embeddings for machine learning on molecular data
title Feature engineered embeddings for machine learning on molecular data
title_full Feature engineered embeddings for machine learning on molecular data
title_fullStr Feature engineered embeddings for machine learning on molecular data
title_full_unstemmed Feature engineered embeddings for machine learning on molecular data
title_short Feature engineered embeddings for machine learning on molecular data
title_sort feature engineered embeddings for machine learning on molecular data
topic UCTD
Machine learning
Data science
Statistics
Biology
Molecules
Embeddings
url https://repository.up.ac.za/handle/2263/89279