Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation

Thesis (MEng)--Stellenbosch University, 2026.

Saved in:

Bibliographic Details
Main Author:	Sims-Handcock, Chad Calvin
Other Authors:	Theart, Rensu
Format:	Thesis
Language:	English
Published:	Stellenbosch : Stellenbosch University 2026
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613911359946752
access_status_str	Open Access
author	Sims-Handcock, Chad Calvin
author2	Theart, Rensu
author_browse	Sims-Handcock, Chad Calvin Theart, Rensu
author_facet	Theart, Rensu Sims-Handcock, Chad Calvin
author_sort	Sims-Handcock, Chad Calvin
collection	Thesis
dc_rights_str_mv	Stellenbosch University
description	Thesis (MEng)--Stellenbosch University, 2026.
format	Thesis
id	oai:scholar.sun.ac.za:10019.1/135845
institution	Stellenbosch University (South Africa)
language	English
last_indexed	2026-06-10T12:43:40.048Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate	2026
publishDateRange	2026
publishDateSort	2026
publisher	Stellenbosch : Stellenbosch University
publisherStr	Stellenbosch : Stellenbosch University
record_format	dspace
source_str	SUNScholar — Stellenbosch University Repository
spelling	oai:scholar.sun.ac.za:10019.1/135845 Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation Sims-Handcock, Chad Calvin Theart, Rensu Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Thesis (MEng)--Stellenbosch University, 2026. Sims-Handcock, C. C. 2026. Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/6c37757d-0f52-4835-9999-59bdab30e11c The increasing reliance on cloud environments for data-driven applications has created a critical tension between operational efficiency and regulatory compliance. Organisations require high-quality, representative data for effective software testing, but traditional Test Data Management (TDM) methods are often resource-intensive and tend to produce low-quality, overly curated test datasets. The use of production data for testing poses significant legal and ethical risks due to the presence of Personally Identifiable Information (PII). Synthetic data – artificially generated data that statistically mimics real-world data without compromising PII – offers a compelling solution. However, generating realistic, high-quality tabular data is a non-trivial task, particularly when the source data is incomplete and messy. This thesis addresses these challenges by developing, implementing, and rigorously evaluating an end-to-end AI-driven pipeline for generating high-quality synthetic tabular data. The pipeline is modular and cloud-native, incorporating robust preprocessing techniques and Missing Value Imputation (MVI) as foundational steps. A formal evaluation framework was developed to assess the quality of synthetic data based on three core dimensions: Fidelity – measuring statistical similarities between synthetic and real data; Utility – measuring performance in downstream machine learning tasks; and Privacy – measuring empirical privacy risks. The experimental results reveal that the quality of the final synthetic data is highly dependent on the initial imputation step, with the Mice Forest algorithm significantly outperforming naive row deletion. In a comparative analysis of generative models, the Tabular Variational Autoencoder (TVAE) emerged as the leading generalist model, achieving the highest fidelity and classification utility. Gaussian Copula, however, demonstrated task-specific excellence in regression tasks where preserving explicit correlation is essential. Importantly, this research provides strong empirical evidence that a trade-off between fidelity and privacy is not inevitable; it is possible to achieve both high fidelity and low privacy risk simultaneously. This thesis establishes a robust foundation for a comprehensive framework designed to enable efficient testing and promote data collaboration through high-quality, privacypreserving synthetic data. It provides a validated mechanism for generating superior test data, mitigating the high costs and compliance risks associated with traditional TDM. Masters 2026-04-13T09:58:20Z 2026-04-13T09:58:20Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135845 en Stellenbosch University 171 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle	Sims-Handcock, Chad Calvin Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title	Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_full	Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_fullStr	Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_full_unstemmed	Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_short	Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_sort	design and evaluation of an ai driven pipeline for synthetic tabular data generation
url	https://scholar.sun.ac.za/handle/10019.1/135845
work_keys_str_mv	AT simshandcockchadcalvin designandevaluationofanaidrivenpipelineforsynthetictabulardatageneration

Full Text Available

Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation

Similar Items