Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Performance analysis of text classification algorithms for PubMed articles

The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary developed by the US National Library of Medicine (NLM) for indexing articles in Pubmed Central (PMC) archive. The annotation process is a complex and time-consuming task relying on subjective manual assignment of MeSH concepts....

Full description

Saved in:

Bibliographic Details
Main Author:	Savvi, Suzana
Other Authors:	Bonenkamp, Koen
Format:	Thesis
Language:	English
Published:	Department of Statistical Sciences 2022
Subjects:	Statistical Sciences
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613941942714368
access_status_str	Open Access
author	Savvi, Suzana
author2	Bonenkamp, Koen
author_browse	Bonenkamp, Koen Savvi, Suzana
author_facet	Bonenkamp, Koen Savvi, Suzana
author_sort	Savvi, Suzana
collection	Thesis
description	The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary developed by the US National Library of Medicine (NLM) for indexing articles in Pubmed Central (PMC) archive. The annotation process is a complex and time-consuming task relying on subjective manual assignment of MeSH concepts. Automating such tasks with machine learning may provide a more efficient way of organizing biomedical literature in a less ambiguous way. This research provides a case study which compares the performance of several different machine learning algorithms (Topic Modelling, Random Forest, Logistic Regression, Support Vector Classifiers, Multinomial Naive Bayes, Convolutional Neural Network and Long Short-Term Memory (LSTM)) in reproducing manually assigned MeSH annotations. Records for this study were retrieved from Pubmed using the E-utilities API to the Entrez system of databases at NCBI (National Centre for Biotechnology Information). The MeSH vocabulary is organised in a hierarchical structure and article abstracts labelled with a single MeSH term from the top second two layers were selected for training the machine learning models. Various strategies for text multiclass classification were considered. One was a Chi-square test for feature selection which identified words relevant to each MeSH label. The second approach used Named Entity Recognition (NER) to extract entities from the unstructured text and another approach relied on word embeddings able to capture latent knowledge from literature. At the start of the study text was tokenised using the Term Frequency Inverse Document Frequency (Tf-idf) technique and topic modelling performed with the objective to ascertain the correlation between assigned topics (unsupervised learning task) and MeSH terms in PubMed. Findings revealed the degree of coupling was low although significant. Of all of the classifier models trained, logistic regression on Tf-idf vectorised entities achieved highest accuracy. Performance varied across the different MeSH categories. In conclusion automated curation of articles by abstract may be possible for those target classes classified reliably and reproducibly.
format	Thesis
id	oai:open.uct.ac.za:11427/36059
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:44:09.393Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2022
publishDateRange	2022
publishDateSort	2022
publisher	Department of Statistical Sciences
publisherStr	Department of Statistical Sciences
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/36059 Performance analysis of text classification algorithms for PubMed articles Savvi, Suzana Bonenkamp, Koen Little, Francesca Statistical Sciences The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary developed by the US National Library of Medicine (NLM) for indexing articles in Pubmed Central (PMC) archive. The annotation process is a complex and time-consuming task relying on subjective manual assignment of MeSH concepts. Automating such tasks with machine learning may provide a more efficient way of organizing biomedical literature in a less ambiguous way. This research provides a case study which compares the performance of several different machine learning algorithms (Topic Modelling, Random Forest, Logistic Regression, Support Vector Classifiers, Multinomial Naive Bayes, Convolutional Neural Network and Long Short-Term Memory (LSTM)) in reproducing manually assigned MeSH annotations. Records for this study were retrieved from Pubmed using the E-utilities API to the Entrez system of databases at NCBI (National Centre for Biotechnology Information). The MeSH vocabulary is organised in a hierarchical structure and article abstracts labelled with a single MeSH term from the top second two layers were selected for training the machine learning models. Various strategies for text multiclass classification were considered. One was a Chi-square test for feature selection which identified words relevant to each MeSH label. The second approach used Named Entity Recognition (NER) to extract entities from the unstructured text and another approach relied on word embeddings able to capture latent knowledge from literature. At the start of the study text was tokenised using the Term Frequency Inverse Document Frequency (Tf-idf) technique and topic modelling performed with the objective to ascertain the correlation between assigned topics (unsupervised learning task) and MeSH terms in PubMed. Findings revealed the degree of coupling was low although significant. Of all of the classifier models trained, logistic regression on Tf-idf vectorised entities achieved highest accuracy. Performance varied across the different MeSH categories. In conclusion automated curation of articles by abstract may be possible for those target classes classified reliably and reproducibly. 2022-03-14T05:21:47Z 2022-03-14T05:21:47Z 2021 2022-03-14T05:18:11Z Master Thesis Masters MSc http://hdl.handle.net/11427/36059 eng application/pdf Department of Statistical Sciences Faculty of Science
spellingShingle	Statistical Sciences Savvi, Suzana Performance analysis of text classification algorithms for PubMed articles
thesis_degree_str	Master's
title	Performance analysis of text classification algorithms for PubMed articles
title_full	Performance analysis of text classification algorithms for PubMed articles
title_fullStr	Performance analysis of text classification algorithms for PubMed articles
title_full_unstemmed	Performance analysis of text classification algorithms for PubMed articles
title_short	Performance analysis of text classification algorithms for PubMed articles
title_sort	performance analysis of text classification algorithms for pubmed articles
topic	Statistical Sciences
url	http://hdl.handle.net/11427/36059
work_keys_str_mv	AT savvisuzana performanceanalysisoftextclassificationalgorithmsforpubmedarticles

Full Text Available

Performance analysis of text classification algorithms for PubMed articles

Similar Items