Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Linear regression techniques for identifying influential data and applications in commercial data analysis

Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residua...

Full description

Saved in:
Bibliographic Details
Main Author: Jacobs, Michael Kalman
Other Authors: Troskie, Cas
Format: Thesis
Language:English
Published: School of Economics 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613289494609920
access_status_str Open Access
author Jacobs, Michael Kalman
author2 Troskie, Cas
author_browse Jacobs, Michael Kalman
Troskie, Cas
author_facet Troskie, Cas
Jacobs, Michael Kalman
author_sort Jacobs, Michael Kalman
collection Thesis
description Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residual as a test for outliers, and Cook's distance as a measure of influence. An outlier test which is distributionally neater than the standardized residual is proposed. In practical applications, ordinary least squares regression is often inappropriate, and the use of biased estimators may be preferable. In this thesis, the existing theory is extended to several alternative regression techniques. Ridge regression and generalized inverse regression are suitable techniques when the cross-product matrix is ill-conditioned. Restricted least squares regression, with exact or stochastic prior information, · is used in many econometric application~. Models with selected · variables-are used to eliminate design faults or to reduce computational effort. New statistics are developed for all these techniques, the distributional results are proved, and computational formulae are developed. Computational problems may arise in the actual use of the various techniques, and these are investigated. Computer programs written in BASIC and suitable for microcomputer use are presented, making the techniques accessible to virtually any commercial environment. The performance of the various techniques is examined, using a controlled simulation study and a number of practical data sets drawn from several areas of South African commerce. This is, as far as can be ascertained, the first extensive practical South African study on the effects of influential data. It is shown that the presence of outliers or influential data can bias the results of any study significantly. It is recommended that no data analysis should be attempted without a preliminary scan of outliers and influential observation. The techniques presented can be used advantageously even in data sets where the ultimate analysis does not involve linear regression. It is shown that influential data are not merely of nuisance value in the analysis but may contain valuable - information in their own right._
format Thesis
id oai:open.uct.ac.za:11427/38914
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:33:45.686Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher School of Economics
publisherStr School of Economics
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/38914 Linear regression techniques for identifying influential data and applications in commercial data analysis Jacobs, Michael Kalman Troskie, Cas Influential data Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residual as a test for outliers, and Cook's distance as a measure of influence. An outlier test which is distributionally neater than the standardized residual is proposed. In practical applications, ordinary least squares regression is often inappropriate, and the use of biased estimators may be preferable. In this thesis, the existing theory is extended to several alternative regression techniques. Ridge regression and generalized inverse regression are suitable techniques when the cross-product matrix is ill-conditioned. Restricted least squares regression, with exact or stochastic prior information, · is used in many econometric application~. Models with selected · variables-are used to eliminate design faults or to reduce computational effort. New statistics are developed for all these techniques, the distributional results are proved, and computational formulae are developed. Computational problems may arise in the actual use of the various techniques, and these are investigated. Computer programs written in BASIC and suitable for microcomputer use are presented, making the techniques accessible to virtually any commercial environment. The performance of the various techniques is examined, using a controlled simulation study and a number of practical data sets drawn from several areas of South African commerce. This is, as far as can be ascertained, the first extensive practical South African study on the effects of influential data. It is shown that the presence of outliers or influential data can bias the results of any study significantly. It is recommended that no data analysis should be attempted without a preliminary scan of outliers and influential observation. The techniques presented can be used advantageously even in data sets where the ultimate analysis does not involve linear regression. It is shown that influential data are not merely of nuisance value in the analysis but may contain valuable - information in their own right._ 2023-09-27T13:58:18Z 2023-09-27T13:58:18Z 1983 2023-09-27T13:53:04Z Doctoral Thesis Doctoral PhD http://hdl.handle.net/11427/38914 eng application/pdf School of Economics Faculty of Commerce
spellingShingle Influential data
Jacobs, Michael Kalman
Linear regression techniques for identifying influential data and applications in commercial data analysis
thesis_degree_str Doctoral
title Linear regression techniques for identifying influential data and applications in commercial data analysis
title_full Linear regression techniques for identifying influential data and applications in commercial data analysis
title_fullStr Linear regression techniques for identifying influential data and applications in commercial data analysis
title_full_unstemmed Linear regression techniques for identifying influential data and applications in commercial data analysis
title_short Linear regression techniques for identifying influential data and applications in commercial data analysis
title_sort linear regression techniques for identifying influential data and applications in commercial data analysis
topic Influential data
url http://hdl.handle.net/11427/38914
work_keys_str_mv AT jacobsmichaelkalman linearregressiontechniquesforidentifyinginfluentialdataandapplicationsincommercialdataanalysis