Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation

Thesis (MEng)--Stellenbosch University, 2026.

Saved in:
Bibliographic Details
Main Author: Sims-Handcock, Chad Calvin
Other Authors: Theart, Rensu
Format: Thesis
Language:English
Published: Stellenbosch : Stellenbosch University 2026
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613911359946752
access_status_str Open Access
author Sims-Handcock, Chad Calvin
author2 Theart, Rensu
author_browse Sims-Handcock, Chad Calvin
Theart, Rensu
author_facet Theart, Rensu
Sims-Handcock, Chad Calvin
author_sort Sims-Handcock, Chad Calvin
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2026.
format Thesis
id oai:scholar.sun.ac.za:10019.1/135845
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:43:40.048Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2026
publishDateRange 2026
publishDateSort 2026
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/135845 Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation Sims-Handcock, Chad Calvin Theart, Rensu Stellenbosch University. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Thesis (MEng)--Stellenbosch University, 2026. Sims-Handcock, C. C. 2026. Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation. Unpublished masters thesis. Stellenbosch: Stellenbosch University [online]. Available: https://scholar.sun.ac.za/items/6c37757d-0f52-4835-9999-59bdab30e11c The increasing reliance on cloud environments for data-driven applications has created a critical tension between operational efficiency and regulatory compliance. Organisations require high-quality, representative data for effective software testing, but traditional Test Data Management (TDM) methods are often resource-intensive and tend to produce low-quality, overly curated test datasets. The use of production data for testing poses significant legal and ethical risks due to the presence of Personally Identifiable Information (PII). Synthetic data – artificially generated data that statistically mimics real-world data without compromising PII – offers a compelling solution. However, generating realistic, high-quality tabular data is a non-trivial task, particularly when the source data is incomplete and messy. This thesis addresses these challenges by developing, implementing, and rigorously evaluating an end-to-end AI-driven pipeline for generating high-quality synthetic tabular data. The pipeline is modular and cloud-native, incorporating robust preprocessing techniques and Missing Value Imputation (MVI) as foundational steps. A formal evaluation framework was developed to assess the quality of synthetic data based on three core dimensions: Fidelity – measuring statistical similarities between synthetic and real data; Utility – measuring performance in downstream machine learning tasks; and Privacy – measuring empirical privacy risks. The experimental results reveal that the quality of the final synthetic data is highly dependent on the initial imputation step, with the Mice Forest algorithm significantly outperforming naive row deletion. In a comparative analysis of generative models, the Tabular Variational Autoencoder (TVAE) emerged as the leading generalist model, achieving the highest fidelity and classification utility. Gaussian Copula, however, demonstrated task-specific excellence in regression tasks where preserving explicit correlation is essential. Importantly, this research provides strong empirical evidence that a trade-off between fidelity and privacy is not inevitable; it is possible to achieve both high fidelity and low privacy risk simultaneously. This thesis establishes a robust foundation for a comprehensive framework designed to enable efficient testing and promote data collaboration through high-quality, privacypreserving synthetic data. It provides a validated mechanism for generating superior test data, mitigating the high costs and compliance risks associated with traditional TDM. Masters 2026-04-13T09:58:20Z 2026-04-13T09:58:20Z 2026-03 Thesis https://scholar.sun.ac.za/handle/10019.1/135845 en Stellenbosch University 171 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle Sims-Handcock, Chad Calvin
Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_full Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_fullStr Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_full_unstemmed Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_short Design and Evaluation of an AI-Driven Pipeline for Synthetic Tabular Data Generation
title_sort design and evaluation of an ai driven pipeline for synthetic tabular data generation
url https://scholar.sun.ac.za/handle/10019.1/135845
work_keys_str_mv AT simshandcockchadcalvin designandevaluationofanaidrivenpipelineforsynthetictabulardatageneration