Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

The variable selection problem and the application of the roc curve for binary outcome variables

Dissertation (MSc)--University of Pretoria, 2008.

Saved in:
Bibliographic Details
Other Authors: Groeneveld, Hendrik T.
Format: Thesis
Published: University of Pretoria 2013
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613476164206592
access_status_str Open Access
author2 Groeneveld, Hendrik T.
author_browse Groeneveld, Hendrik T.
author_facet Groeneveld, Hendrik T.
collection Thesis
dc_rights_str_mv © University of Pretoria 2006E892 /
description Dissertation (MSc)--University of Pretoria, 2008.
format Thesis
id oai:repository.up.ac.za:2263/27133
institution University of Pretoria (South Africa)
last_indexed 2026-06-10T12:36:45.136Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate 2013
publishDateRange 2013
publishDateSort 2013
publisher University of Pretoria
publisherStr University of Pretoria
record_format dspace
source_str UPSpace — University of Pretoria Institutional Repository
spelling oai:repository.up.ac.za:2263/27133 The variable selection problem and the application of the roc curve for binary outcome variables Groeneveld, Hendrik T. Van der Merwe, A.J. jmatshego@lantic.net Matshego, James Moeng Logistic regression Parameter planning UCTD Dissertation (MSc)--University of Pretoria, 2008. Variable selection refers to the problem of selecting input variables that are most predictive of a given outcome. Variable selection problems are found in all machine learning tasks, supervised or unsupervised, classification, regression, time series prediction , two - class or multi-class, posing various levels of challenges. Variables selection problems are related to the problems of input dimensionality reduction and of parameter planning. It has practical and theoretical challenges of its own. From the practical point of view, eliminating variables may reduce the cost of producing the outcome and increase its speed, while space dimensionality does not address these problems. Theoretical challenges include estimating with what confidence one can state that a variable is relevant to the concept when it is useful to the outcome and providing a theoretical understanding of the stability of selected variables subsets. As the probability cut-points increase in value, the more likely it becomes that an observation is classified as a non-event by the selected variables. The mathematical statement of the problem is not widely agreed upon and may depend on the application. One typically distinguishes: i) The problem of discovering all the variables relevant to the outcome variable and determine HOW relevant they are and how they are related to each other. ii) The problem of finding a minimum subset of variables that is useful to the outcome variable. Logistic regression is an increasingly popular statistical technique used to model the probability of discrete binary outcome. Logistic regression applies maximum likelihood estimation after transforming the outcome variable into a logit variable. In this way, logistic regression estimates the probability of a certain event. When properly applied, logistic regression analyses yield a very powerful insight in to what variables are more or less likely to predict event outcome in a population of interest. These models also show the extent to which changes in the values of the variable may increase or decrease the predicted probability of event outcome. Variable selection, in all its facets is similarly important with logistic regression. The receiver operating characteristics (ROC) curve is a graphic display that gives a measure of the predictive accuracy of a logistic regression model. It is a measure of classification performance, the area under the ROC curve (AUC) is a scalar measure gauging one facet of performance. Another measure of predictive accuracy of a logistic regression model is a classification table. It uses the model to classifying observations as events if their estimated probability is greater or equal to a given probability cut-point, otherwise events are classified as non-events. This technique, as it appears in the literature, is also studied in this thesis. In this thesis the issue of variable selection, both for continuous and binary outcome variables, is investigated as it appears in the statistical literature. It is clear that this topic has been widely researched and still remains a feature of modern research. The last word certainly hasn’t been spoken. Statistics unrestricted 2013-09-07T10:40:44Z 2008-08-11 2013-09-07T10:40:44Z 2007-09-06 2008-08-11 2008-08-11 Dissertation a 2006E892 /ag http://hdl.handle.net/2263/27133 http://upetd.up.ac.za/thesis/available/etd-08112008-104847/ © University of Pretoria 2006E892 / application/pdf University of Pretoria
spellingShingle Logistic regression
Parameter planning
UCTD
The variable selection problem and the application of the roc curve for binary outcome variables
title The variable selection problem and the application of the roc curve for binary outcome variables
title_full The variable selection problem and the application of the roc curve for binary outcome variables
title_fullStr The variable selection problem and the application of the roc curve for binary outcome variables
title_full_unstemmed The variable selection problem and the application of the roc curve for binary outcome variables
title_short The variable selection problem and the application of the roc curve for binary outcome variables
title_sort variable selection problem and the application of the roc curve for binary outcome variables
topic Logistic regression
Parameter planning
UCTD
url http://hdl.handle.net/2263/27133
http://upetd.up.ac.za/thesis/available/etd-08112008-104847/