Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Hierarchical text classification with transformer-based language models

Thesis (MSc)--Stellenbosch University, 2024.

Saved in:
Bibliographic Details
Main Author: Du Toit, Jaco
Other Authors: Dunaiski, Marcel
Format: Thesis
Language:en_ZA
en_ZA
Published: Stellenbosch : Stellenbosch University 2024
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613752040357888
access_status_str Open Access
author Du Toit, Jaco
author2 Dunaiski, Marcel
author_browse Du Toit, Jaco
Dunaiski, Marcel
author_facet Dunaiski, Marcel
Du Toit, Jaco
author_sort Du Toit, Jaco
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc)--Stellenbosch University, 2024.
format Thesis
id oai:scholar.sun.ac.za:10019.1/130642
institution Stellenbosch University (South Africa)
language en_ZA
en_ZA
last_indexed 2026-06-10T12:41:07.950Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2024
publishDateRange 2024
publishDateSort 2024
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/130642 Hierarchical text classification with transformer-based language models Du Toit, Jaco Dunaiski, Marcel Stellenbosch University. Faculty of Science. Dept. of Computer Science. Natural language processing (Computer science) -- Data processing Hierarchical text classification Computational linguistics UCTD Thesis (MSc)--Stellenbosch University, 2024. ENGLISH ABSTRACT: Hierarchical text classification (HTC) is a natural language processing (NLP) task which has the objective of classifying text documents into a set of classes from a structured class hierarchy. For example, news articles can be classified into a hierarchical class set which comprises broad categories such as “Politics” and “Sport” in higher-levels with associated finer-grained categories such as “Europe” and “Cycling” in lower-levels. In recent years many different NLP approaches have been significantly improved through the use of transformer-based pre-trained language mod- els (PLMs). PLMs are typically trained on large amounts of textual data through self-supervised tasks such that they acquire language understanding capabilities which can be used to solve various NLP tasks, including HTC. In this thesis, we propose three new approaches for leveraging transformer-based PLMs to improve classification performance on HTC tasks. Our first approach formulates how hierarchy-aware prompts can be applied to discriminative language models such that it allows HTC tasks to scale to problems with very large hierarchical class structures. Our second approach uses label-wise attention mechanisms to obtain label-specific document repre- sentations which are used to fine-tune PLMs for HTC tasks. Furthermore, we propose a label-wise attention mechanism which splits the attention mecha- nisms into the different levels of the class hierarchy and leverages the predic- tions of all ancestor levels during the prediction of classes at a particular level. The third approach combines features extracted from a PLM and a topic model to train a classifier which comprises convolutional layers followed by a label- wise attention mechanism. We evaluate all three approaches comprehensively and show that our first two proposed approaches obtain state-of-the-art per- formances on three HTC benchmark datasets. Our results show that the use of prompts and label-wise attention mechanisms to fine-tune PLMs are very effective techniques for classifying text documents into hierarchical class sets. Furthermore, we show that these techniques are able to effectively leverage the language understanding capabilities of PLMs and incorporate the hierarchical class structure information to improve classification performance. We also introduce three new HTC benchmark datasets which comprise the titles and abstracts of research publications from the Web of Science publica- tion database with associated categories. The first two datasets use journal- and citation-based classification schemas respectively, while the third dataset combines these classifications with the aim of removing documents and classes which do not have a clear overlap between the two schemas. We show that this results in a more consistent classification of the publications. Finally, we per- form experiments on these three datasets with the best-performing approaches proposed in this thesis to provide a baseline for future research. AFRIKAANSE OPSOMMING: Hiërargiese teksklassifikasie (HTC) is ’n natuurliketaalverwerkingstaak (NLP) wat die doel het om teksdokumente te klassifiseer in ’n klasversameling vanuit ’n gestruktureerde klashiërargie. Byvoorbeeld, nuusartikels kan geklassifiseer word in ’n hiërargiese klasversameling wat bestaan uit breë kategorieë soos “Politiek” en “Sport” op hoër vlakke met gepaardgaande fyner kategorieë soos “Europa” en “Fietsry” in laer vlakke. In die afgelope paar jaar is baie verskillende metodes vir NLP-take aan- sienlik verbeter deur die gebruik van transformatorgebaseerde voorafgeleerde taalmodelle. Taalmodelle word tipies geleer op groot hoeveelhede teksdata deur middel van selftoesig take, sodat hulle taalbegripvermoëns verkry wat gebruik kan word om verskeie NLP-take op te los, insluitend HTC. In hierdie tesis stel ons drie nuwe metodes voor om transformatorgebaseerde taalmodelle te gebruik om klassifikasie vir HTC-take te verbeter. Ons eerste metode formuleer hoe ’n hiërargie-bewuste por toegepas kan word op onderskeidmakende taalmodelle, sodat dit HTC-take toelaat om te skaal na probleme met baie groot hiërargiese klasstrukture. Ons tweede me- tode gebruik klasgewys aandagsmeganismes om klasspesifieke dokumentvoor- stellings te verkry wat gebruik word om taalmodelle vir HTC-take te verfyn. Verder stel ons ’n klasgewys aandagsmeganisme voor wat die aandagsmeganis- mes in die verskillende vlakke van die klashiërargie verdeel en die voorspellings van alle hoër vlakke benut tydens die voorspelling van klasse op ’n spesifieke vlak. Die derde metode kombineer dokumentvoorstellings wat uit ’n taalmodel en ’n onderwerpmodel onttrek is om ’n klassifikasiemodel te leer wat konvo- lusielae bevat, gevolg deur ’n klasgewys aandagsmeganisme. Ons evalueer al drie metodes en wys dat ons eerste twee metodes beter presteer as onlangs voorgestelde metodes op drie HTC-maatstafdataversamelings. Ons resultate wys dat die gebruik van ’n hiërargie-bewuste por en klasgewys aandagsmega- nismes om taalmodelle te verfyn baie effektiewe tegnieke is om teksdokumente in hiërargiese klasversamelings te klassifiseer. Verder wys ons dat hierdie teg- nieke in staat is om die taalbegripvermoë van taalmodelle effektief te benut en die hiërargiese klasstruktuurinligting te gebruik om klassifikasie te verbeter. Ons stel ook drie nuwe HTC-maatstafdataversamelings bekend wat die ti- tels en opsommings van navorsingspublikasies uit die Web of Science publi- kasiedatabasis met gepaardgaande kategorieë bevat. Die eerste twee data- versamelings gebruik joernaalgebaseerde en aanhalinggebaseerde klassifikasie- skema’s onderskeidelik, terwyl die derde dataversameling hierdie klassifikasies kombineer met die doel om dokumente en klasse te verwyder wat nie ’n duide- like oorvleueling tussen die twee skemas het nie. Ons wys dat ons voorgestelde metode lei tot ’n meer konsekwente klassifikasie van die publikasies. Laastens evalueer ons die bes-presterende metodes wat in hierdie tesis voorgestel is op die drie nuwe dataversamelings om ’n basislyn vir toekomstige navorsing te verskaf. Masters 2024-02-12T08:42:29Z 2024-04-27T01:02:36Z 2024-02-12T08:42:29Z 2024-04-27T01:02:36Z 2024-03 Thesis https://scholar.sun.ac.za/handle/10019.1/130642 en_ZA en_ZA Stellenbosch University xvi, 128 pages : illustrations (some color) application/pdf Stellenbosch : Stellenbosch University
spellingShingle Natural language processing (Computer science) -- Data processing
Hierarchical text classification
Computational linguistics
UCTD
Du Toit, Jaco
Hierarchical text classification with transformer-based language models
title Hierarchical text classification with transformer-based language models
title_full Hierarchical text classification with transformer-based language models
title_fullStr Hierarchical text classification with transformer-based language models
title_full_unstemmed Hierarchical text classification with transformer-based language models
title_short Hierarchical text classification with transformer-based language models
title_sort hierarchical text classification with transformer based language models
topic Natural language processing (Computer science) -- Data processing
Hierarchical text classification
Computational linguistics
UCTD
url https://scholar.sun.ac.za/handle/10019.1/130642
work_keys_str_mv AT dutoitjaco hierarchicaltextclassificationwithtransformerbasedlanguagemodels