Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

A framework for evaluating semi-structured hierarchical data using language models

Thesis (MEng)--Stellenbosch University, 2026.

Saved in:

Bibliographic Details
Main Author:	Du Plessis, Stephan Visser
Other Authors:	Van Vuuren, J. H.
Format:	Thesis
Language:	English
Published:	Stellenbosch : Stellenbosch University 2026
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613843957481472
access_status_str	Open Access
author	Du Plessis, Stephan Visser
author2	Van Vuuren, J. H.
author_browse	Du Plessis, Stephan Visser Van Vuuren, J. H.
author_facet	Van Vuuren, J. H. Du Plessis, Stephan Visser
author_sort	Du Plessis, Stephan Visser
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2026.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/135772
institution	Stellenbosch University (South Africa)
language	English
last_indexed	2026-06-10T12:42:35.472Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2026
publishDateRange	2026
publishDateSort	2026
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/135772 A framework for evaluating semi-structured hierarchical data using language models Du Plessis, Stephan Visser Van Vuuren, J. H. Nel, G. S. Stellenbosch University. Faculty of Engineering. Dept. of Industrial Engineering. Thesis (MEng)--Stellenbosch University, 2026. Du Plessis, S. V. 2026. A framework for evaluating semi-structured hierarchical data using language models. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/413a765e-7ac0-4753-81a4-d243d167136c Recent advances in large language models have intensified interest in applying them to tasks such as extractive question answering over semi-structured hierarchical data represented in markup languages. The conventional practice of linearising markup to plain text involves removing structural information that is integral to interpretation, and existing structure-aware approaches are often supported by ad hoc, task-specific pipelines. Limited guidance is available on how heterogeneous sources should be transformed into structure-preserving representations, how alternative model adaptation strategies should be compared, and how the impact of structural information should be quantified in a reproducible manner. This has resulted in fragmented methodologies for the development and assessment of language model-based systems for semi-structured data. A generic framework is proposed in this thesis for the processing and evaluation of semi-structured hierarchical data by language models in the context of extractive question answering. The framework is specified as a modular architecture comprising a data preparation component for transforming heterogeneous markup and tabular sources into a canonical markup-rich representation and, where required, synthesising labelled question answering pairs; a model training component for configuring and adapting pre-trained models; and a performance evaluation component for computing text-based and structure-aware metrics as well as organising structured experimental comparisons. The framework is intended to provide a principled basis on which markup-aware question answering systems may be developed and analysed across application domains. A proof-of-concept instantiation of the framework is implemented and subjected to verification and validation. Verification is conducted by applying the instantiation to a web-based HTML question answering benchmark, confirming that performance comparable with reported baselines is attained and that discarding structural information in favour of text-only input leads to measurable degradation. The practical utility and robustness of the framework are then assessed by carrying out various case studies involving semi-structured tables, combined tabular and textual sources, and synthetic relational data. Across these studies, configurations that exploit markup structure consistently yield higher scores in respect of standard evaluation metrics, thereby supporting the contention that structural information of semi-structured documents constitutes a primary signal for language model-based extractive question answering. Masters 2026-04-10T06:32:36Z 2026-04-10T06:32:36Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135772 en Stellenbosch University 247 pages : ill. application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Du Plessis, Stephan Visser A framework for evaluating semi-structured hierarchical data using language models
title	A framework for evaluating semi-structured hierarchical data using language models
title_full	A framework for evaluating semi-structured hierarchical data using language models
title_fullStr	A framework for evaluating semi-structured hierarchical data using language models
title_full_unstemmed	A framework for evaluating semi-structured hierarchical data using language models
title_short	A framework for evaluating semi-structured hierarchical data using language models
title_sort	framework for evaluating semi structured hierarchical data using language models
url	https://scholar.sun.ac.za/handle/10019.1/135772
work_keys_str_mv	AT duplessisstephanvisser aframeworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels AT duplessisstephanvisser frameworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels

Full Text Available

A framework for evaluating semi-structured hierarchical data using language models

Similar Items