Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng)--Stellenbosch University, 2026.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Stellenbosch : Stellenbosch University
2026
|
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613843957481472 |
|---|---|
| access_status_str | Open Access |
| author | Du Plessis, Stephan Visser |
| author2 | Van Vuuren, J. H. |
| author_browse | Du Plessis, Stephan Visser Van Vuuren, J. H. |
| author_facet | Van Vuuren, J. H. Du Plessis, Stephan Visser |
| author_sort | Du Plessis, Stephan Visser |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MEng)--Stellenbosch University, 2026. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/135772 |
| institution | Stellenbosch University (South Africa) |
| language | English |
| last_indexed | 2026-06-10T12:42:35.472Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2026 |
| publishDateRange | 2026 |
| publishDateSort | 2026 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/135772 A framework for evaluating semi-structured hierarchical data using language models Du Plessis, Stephan Visser Van Vuuren, J. H. Nel, G. S. Stellenbosch University. Faculty of Engineering. Dept. of Industrial Engineering. Thesis (MEng)--Stellenbosch University, 2026. Du Plessis, S. V. 2026. A framework for evaluating semi-structured hierarchical data using language models. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/413a765e-7ac0-4753-81a4-d243d167136c Recent advances in large language models have intensified interest in applying them to tasks such as extractive question answering over semi-structured hierarchical data represented in markup languages. The conventional practice of linearising markup to plain text involves removing structural information that is integral to interpretation, and existing structure-aware approaches are often supported by ad hoc, task-specific pipelines. Limited guidance is available on how heterogeneous sources should be transformed into structure-preserving representations, how alternative model adaptation strategies should be compared, and how the impact of structural information should be quantified in a reproducible manner. This has resulted in fragmented methodologies for the development and assessment of language model-based systems for semi-structured data. A generic framework is proposed in this thesis for the processing and evaluation of semi-structured hierarchical data by language models in the context of extractive question answering. The framework is specified as a modular architecture comprising a data preparation component for transforming heterogeneous markup and tabular sources into a canonical markup-rich representation and, where required, synthesising labelled question answering pairs; a model training component for configuring and adapting pre-trained models; and a performance evaluation component for computing text-based and structure-aware metrics as well as organising structured experimental comparisons. The framework is intended to provide a principled basis on which markup-aware question answering systems may be developed and analysed across application domains. A proof-of-concept instantiation of the framework is implemented and subjected to verification and validation. Verification is conducted by applying the instantiation to a web-based HTML question answering benchmark, confirming that performance comparable with reported baselines is attained and that discarding structural information in favour of text-only input leads to measurable degradation. The practical utility and robustness of the framework are then assessed by carrying out various case studies involving semi-structured tables, combined tabular and textual sources, and synthetic relational data. Across these studies, configurations that exploit markup structure consistently yield higher scores in respect of standard evaluation metrics, thereby supporting the contention that structural information of semi-structured documents constitutes a primary signal for language model-based extractive question answering. Masters 2026-04-10T06:32:36Z 2026-04-10T06:32:36Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135772 en Stellenbosch University 247 pages : ill. application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Du Plessis, Stephan Visser A framework for evaluating semi-structured hierarchical data using language models |
| title | A framework for evaluating semi-structured hierarchical data using language models |
| title_full | A framework for evaluating semi-structured hierarchical data using language models |
| title_fullStr | A framework for evaluating semi-structured hierarchical data using language models |
| title_full_unstemmed | A framework for evaluating semi-structured hierarchical data using language models |
| title_short | A framework for evaluating semi-structured hierarchical data using language models |
| title_sort | framework for evaluating semi structured hierarchical data using language models |
| url | https://scholar.sun.ac.za/handle/10019.1/135772 |
| work_keys_str_mv | AT duplessisstephanvisser aframeworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels AT duplessisstephanvisser frameworkforevaluatingsemistructuredhierarchicaldatausinglanguagemodels |