Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Automated feature synthesis on big data using cloud computing resources

The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptio...

Full description

Saved in:

Bibliographic Details
Main Author:	Saker, Vanessa
Other Authors:	Berman, Sonia
Format:	Thesis
Language:	English
Published:	University of Cape Town 2020
Subjects:	Computer Science Data Analytics Cloud Computing Big Data
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613248991264769
access_status_str	Open Access
author	Saker, Vanessa
author2	Berman, Sonia
author_browse	Berman, Sonia Saker, Vanessa
author_facet	Berman, Sonia Saker, Vanessa
author_sort	Saker, Vanessa
collection	Thesis
description	The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results.
format	Thesis
id	oai:open.uct.ac.za:11427/32452
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:33:08.525Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2020
publishDateRange	2020
publishDateSort	2020
publisher	University of Cape Town
publisherStr	University of Cape Town
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/32452 Automated feature synthesis on big data using cloud computing resources Saker, Vanessa Berman, Sonia Computer Science Data Analytics Cloud Computing Big Data The data analytics process has many time-consuming steps. Combining data that sits in a relational database warehouse into a single relation while aggregating important information in a meaningful way and preserving relationships across relations, is complex and time-consuming. This step is exceptionally important as many machine learning algorithms require a single file format as an input (e.g. supervised and unsupervised learning, feature representation and feature learning, etc.). An analyst is required to manually combine relations while generating new, more impactful information points from data during the feature synthesis phase of the feature engineering process that precedes machine learning. Furthermore, the entire process is complicated by Big Data factors such as processing power and distributed data storage. There is an open-source package, Featuretools, that uses an innovative algorithm called Deep Feature Synthesis to accelerate the feature engineering step. However, when working with Big Data, there are two major limitations. The first is the curse of modularity - Featuretools stores data in-memory to process it and thus, if data is large, it requires a processing unit with a large memory. Secondly, the package is dependent on data stored in a Pandas DataFrame. This makes the use of Featuretools with Big Data tools such as Apache Spark, a challenge. This dissertation aims to examine the viability and effectiveness of using Featuretools for feature synthesis with Big Data on the cloud computing platform, AWS. Exploring the impact of generated features is a critical first step in solving any data analytics problem. If this can be automated in a distributed Big Data environment with a reasonable investment of time and funds, data analytics exercises will benefit considerably. In this dissertation, a framework for automated feature synthesis with Big Data is proposed and an experiment conducted to examine its viability. Using this framework, an infrastructure was built to support the process of feature synthesis on AWS that made use of S3 storage buckets, Elastic Cloud Computing services, and an Elastic MapReduce cluster. A dataset of 95 million customers, 34 thousand fraud cases and 5.5 million transactions across three different relations was then loaded into the distributed relational database on the platform. The infrastructure was used to show how the dataset could be prepared to represent a business problem, and Featuretools used to generate a single feature matrix suitable for inclusion in a machine learning pipeline. The results show that the approach was viable. The feature matrix produced 75 features from 12 input variables and was time efficient with a total end-to-end run time of 3.5 hours and a cost of approximately R 814 (approximately $52). The framework can be applied to a different set of data and allows the analysts to experiment on a small section of the data until a final feature set is decided. They are able to easily scale the feature matrix to the full dataset. This ability to automate feature synthesis, iterate and scale up, will save time in the analytics process while providing a richer feature set for better machine learning results. 2020-12-30T10:17:56Z 2020-12-30T10:17:56Z 2020 Master Thesis Masters MSc http://hdl.handle.net/11427/32452 eng application/pdf University of Cape Town Department of Statistical Sciences Faculty of Science
spellingShingle	Computer Science Data Analytics Cloud Computing Big Data Saker, Vanessa Automated feature synthesis on big data using cloud computing resources
thesis_degree_str	Master's
title	Automated feature synthesis on big data using cloud computing resources
title_full	Automated feature synthesis on big data using cloud computing resources
title_fullStr	Automated feature synthesis on big data using cloud computing resources
title_full_unstemmed	Automated feature synthesis on big data using cloud computing resources
title_short	Automated feature synthesis on big data using cloud computing resources
title_sort	automated feature synthesis on big data using cloud computing resources
topic	Computer Science Data Analytics Cloud Computing Big Data
url	http://hdl.handle.net/11427/32452
work_keys_str_mv	AT sakervanessa automatedfeaturesynthesisonbigdatausingcloudcomputingresources

Full Text Available

Automated feature synthesis on big data using cloud computing resources

Similar Items