Full Text Available

Access Repository Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Dataset selection for aggregate model implementation in predictive data mining

Thesis (PhD)--University of Pretoria, 2010.

Saved in:

Bibliographic Details
Other Authors:	Engelbrecht, Andries P.
Format:	Thesis
Published:	University of Pretoria 2013
Subjects:	Dataset partitioning Data mining Bias reduction Predictive modeling Classification Model aggregation Ensemble classifiers Ova classification Pvn classification Dataset selection Featureselection Variable selection Large datasets Variance reduction Dataset sampling UCTD
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613530795016192
access_status_str	Open Access
author2	Engelbrecht, Andries P.
author_browse	Engelbrecht, Andries P.
author_facet	Engelbrecht, Andries P.
collection	Thesis
dc_rights_str_mv	© 2010, University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description	Thesis (PhD)--University of Pretoria, 2010.
format	Thesis
id	oai:repository.up.ac.za:2263/29486
institution	University of Pretoria (South Africa)
last_indexed	2026-06-10T12:37:37.270Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate	2013
publishDateRange	2013
publishDateSort	2013
publisher	University of Pretoria
publisherStr	University of Pretoria
record_format	dspace
source_str	UPSpace — University of Pretoria Institutional Repository
spelling	oai:repository.up.ac.za:2263/29486 Dataset selection for aggregate model implementation in predictive data mining Engelbrecht, Andries P. plutu@cs.up.ac.za Lutu, Patricia Elizabeth Nalwoga Dataset partitioning Data mining Bias reduction Predictive modeling Classification Model aggregation Ensemble classifiers Ova classification Pvn classification Dataset selection Featureselection Variable selection Large datasets Variance reduction Dataset sampling UCTD Thesis (PhD)--University of Pretoria, 2010. Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results. Computer Science unrestricted 2013-09-07T15:45:36Z 2010-11-15 2013-09-07T15:45:36Z 2010-09-02 2010-11-15 2010-11-15 Thesis Lutu, PEN 2010, Dataset selection for aggregate model implementation in predictive data mining, PhD thesis, University of Pretoria, Pretoria, viewed yymmdd < http://hdl.handle.net/2263/29486 > D10/775/gm http://hdl.handle.net/2263/29486 http://upetd.up.ac.za/thesis/available/etd-11152010-203041/ © 2010, University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf application/pdf application/pdf application/pdf application/pdf application/pdf application/pdf application/pdf application/pdf University of Pretoria
spellingShingle	Dataset partitioning Data mining Bias reduction Predictive modeling Classification Model aggregation Ensemble classifiers Ova classification Pvn classification Dataset selection Featureselection Variable selection Large datasets Variance reduction Dataset sampling UCTD Dataset selection for aggregate model implementation in predictive data mining
title	Dataset selection for aggregate model implementation in predictive data mining
title_full	Dataset selection for aggregate model implementation in predictive data mining
title_fullStr	Dataset selection for aggregate model implementation in predictive data mining
title_full_unstemmed	Dataset selection for aggregate model implementation in predictive data mining
title_short	Dataset selection for aggregate model implementation in predictive data mining
title_sort	dataset selection for aggregate model implementation in predictive data mining
topic	Dataset partitioning Data mining Bias reduction Predictive modeling Classification Model aggregation Ensemble classifiers Ova classification Pvn classification Dataset selection Featureselection Variable selection Large datasets Variance reduction Dataset sampling UCTD
url	http://hdl.handle.net/2263/29486 http://upetd.up.ac.za/thesis/available/etd-11152010-203041/

Full Text Available

Dataset selection for aggregate model implementation in predictive data mining

Similar Items