Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

A framework for evaluating semi-structured hierarchical data using language models

Thesis (MEng)--Stellenbosch University, 2026.

Saved in:
Bibliographic Details
Main Author: Du Plessis, Stephan Visser
Other Authors: Van Vuuren, J. H.
Format: Thesis
Language:English
Published: Stellenbosch : Stellenbosch University 2026
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613843957481472
access_status_str Open Access
author Du Plessis, Stephan Visser
author2 Van Vuuren, J. H.
author_browse Du Plessis, Stephan Visser
Van Vuuren, J. H.
author_facet Van Vuuren, J. H.
Du Plessis, Stephan Visser
author_sort Du Plessis, Stephan Visser
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2026.
format Thesis
id oai:scholar.sun.ac.za:10019.1/135772
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:42:35.472Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/135772 A framework for evaluating semi-structured hierarchical data using language models Du Plessis, Stephan Visser Van Vuuren, J. H. Nel, G. S. Stellenbosch University. Faculty of Engineering. Dept. of Industrial Engineering. Thesis (MEng)--Stellenbosch University, 2026. Du Plessis, S. V. 2026. A framework for evaluating semi-structured hierarchical data using language models. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/413a765e-7ac0-4753-81a4-d243d167136c Recent advances in large language models have intensified interest in applying them to tasks such as extractive question answering over semi-structured hierarchical data represented in markup languages. The conventional practice of linearising markup to plain text involves removing structural information that is integral to interpretation, and existing structure-aware approaches are often supported by ad hoc, task-specific pipelines. Limited guidance is available on how heterogeneous sources should be transformed into structure-preserving representations, how alternative model adaptation strategies should be compared, and how the impact of structural information should be quantified in a reproducible manner. This has resulted in fragmented methodologies for the development and assessment of language model-based systems for semi-structured data. A generic framework is proposed in this thesis for the processing and evaluation of semi-structured hierarchical data by language models in the context of extractive question answering. The framework is specified as a modular architecture comprising a data preparation component for transforming heterogeneous markup and tabular sources into a canonical markup-rich representation and, where required, synthesising labelled question answering pairs; a model training component for configuring and adapting pre-trained models; and a performance evaluation component for computing text-based and structure-aware metrics as well as organising structured experimental comparisons. The framework is intended to provide a principled basis on which markup-aware question answering systems may be developed and analysed across application domains. A proof-of-concept instantiation of the framework is implemented and subjected to verification and validation. Verification is conducted by applying the instantiation to a web-based HTML question answering benchmark, confirming that performance comparable with reported baselines is attained and that discarding structural information in favour of text-only input leads to measurable degradation. The practical utility and robustness of the framework are then assessed by carrying out various case studies involving semi-structured tables, combined tabular and textual sources, and synthetic relational data. Across these studies, configurations that exploit markup structure consistently yield higher scores in respect of standard evaluation metrics, thereby supporting the contention that structural information of semi-structured documents constitutes a primary signal for language model-based extractive question answering. Masters 2026-04-10T06:32:36Z 2026-04-10T06:32:36Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135772 en Stellenbosch University 247 pages : ill. application/pdf Stellenbosch : Stellenbosch University
spellingShingle Du Plessis, Stephan Visser
A framework for evaluating semi-structured hierarchical data using language models
title A framework for evaluating semi-structured hierarchical data using language models
title_full A framework for evaluating semi-structured hierarchical data using language models
title_fullStr A framework for evaluating semi-structured hierarchical data using language models
title_full_unstemmed A framework for evaluating semi-structured hierarchical data using language models
title_short A framework for evaluating semi-structured hierarchical data using language models
title_sort framework for evaluating semi structured hierarchical data using language models
url https://scholar.sun.ac.za/handle/10019.1/135772
work_keys_str_mv AT duplessisstephanvisser aframeworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels
AT duplessisstephanvisser frameworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels