Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Thesis (MEng)--Stellenbosch University, 2026.
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | English |
| Published: |
Stellenbosch : Stellenbosch University
2026
|
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613911359946752 |
|---|---|
| access_status_str | Open Access |
| author | Sims-Handcock, Chad Calvin |
| author2 | Theart, Rensu |
| author_browse | Sims-Handcock, Chad Calvin Theart, Rensu |
| author_facet | Theart, Rensu Sims-Handcock, Chad Calvin |
| author_sort | Sims-Handcock, Chad Calvin |
| collection | Thesis |
| dc_rights_str_mv | Stellenbosch University |
| description | Thesis (MEng)--Stellenbosch University, 2026. |
| format | Thesis |
| id | oai:scholar.sun.ac.za:10019.1/135845 |
| institution | Stellenbosch University (South Africa) |
| language | English |
| last_indexed | 2026-06-10T12:43:40.048Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository |
| publishDate | 2026 |
| publishDateRange | 2026 |
| publishDateSort | 2026 |
| publisher | Stellenbosch : Stellenbosch University |
| publisherStr | Stellenbosch : Stellenbosch University |
| record_format | dspace |
| source_str | SUNScholar — Stellenbosch University Repository |
| spelling | oai:scholar.sun.ac.za:10019.1/135845 Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation Sims-Handcock, Chad Calvin Theart, Rensu Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Thesis (MEng)--Stellenbosch University, 2026. Sims-Handcock, C. C. 2026. Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/6c37757d-0f52-4835-9999-59bdab30e11c The increasing reliance on cloud environments for data-driven applications has created a critical tension between operational efficiency and regulatory compliance. Organisations require high-quality, representative data for effective software testing, but traditional Test Data Management (TDM) methods are often resource-intensive and tend to produce low-quality, overly curated test datasets. The use of production data for testing poses significant legal and ethical risks due to the presence of Personally Identifiable Information (PII). Synthetic data – artificially generated data that statistically mimics real-world data without compromising PII – offers a compelling solution. However, generating realistic, high-quality tabular data is a non-trivial task, particularly when the source data is incomplete and messy. This thesis addresses these challenges by developing, implementing, and rigorously evaluating an end-to-end AI-driven pipeline for generating high-quality synthetic tabular data. The pipeline is modular and cloud-native, incorporating robust preprocessing techniques and Missing Value Imputation (MVI) as foundational steps. A formal evaluation framework was developed to assess the quality of synthetic data based on three core dimensions: Fidelity – measuring statistical similarities between synthetic and real data; Utility – measuring performance in downstream machine learning tasks; and Privacy – measuring empirical privacy risks. The experimental results reveal that the quality of the final synthetic data is highly dependent on the initial imputation step, with the Mice Forest algorithm significantly outperforming naive row deletion. In a comparative analysis of generative models, the Tabular Variational Autoencoder (TVAE) emerged as the leading generalist model, achieving the highest fidelity and classification utility. Gaussian Copula, however, demonstrated task-specific excellence in regression tasks where preserving explicit correlation is essential. Importantly, this research provides strong empirical evidence that a trade-off between fidelity and privacy is not inevitable; it is possible to achieve both high fidelity and low privacy risk simultaneously. This thesis establishes a robust foundation for a comprehensive framework designed to enable efficient testing and promote data collaboration through high-quality, privacypreserving synthetic data. It provides a validated mechanism for generating superior test data, mitigating the high costs and compliance risks associated with traditional TDM. Masters 2026-04-13T09:58:20Z 2026-04-13T09:58:20Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135845 en Stellenbosch University 171 pages application/pdf Stellenbosch : Stellenbosch University |
| spellingShingle | Sims-Handcock, Chad Calvin Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title | Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title_full | Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title_fullStr | Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title_full_unstemmed | Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title_short | Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation |
| title_sort | design and evaluation of an ai driven pipeline for synthetic tabular data generation |
| url | https://scholar.sun.ac.za/handle/10019.1/135845 |
| work_keys_str_mv | AT simshandcockchadcalvin designandevaluationofanaidrivenpipelineforsynthetictabulardatageneration |