Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Linear regression techniques for identifying influential data and applications in commercial data analysis

Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residua...

Full description

Saved in:

Bibliographic Details
Main Author:	Jacobs, Michael Kalman
Other Authors:	Troskie, Cas
Format:	Thesis
Language:	English
Published:	School of Economics 2023
Subjects:	Influential data
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613289494609920
access_status_str	Open Access
author	Jacobs, Michael Kalman
author2	Troskie, Cas
author_browse	Jacobs, Michael Kalman Troskie, Cas
author_facet	Troskie, Cas Jacobs, Michael Kalman
author_sort	Jacobs, Michael Kalman
collection	Thesis
description	Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residual as a test for outliers, and Cook's distance as a measure of influence. An outlier test which is distributionally neater than the standardized residual is proposed. In practical applications, ordinary least squares regression is often inappropriate, and the use of biased estimators may be preferable. In this thesis, the existing theory is extended to several alternative regression techniques. Ridge regression and generalized inverse regression are suitable techniques when the cross-product matrix is ill-conditioned. Restricted least squares regression, with exact or stochastic prior information, · is used in many econometric application~. Models with selected · variables-are used to eliminate design faults or to reduce computational effort. New statistics are developed for all these techniques, the distributional results are proved, and computational formulae are developed. Computational problems may arise in the actual use of the various techniques, and these are investigated. Computer programs written in BASIC and suitable for microcomputer use are presented, making the techniques accessible to virtually any commercial environment. The performance of the various techniques is examined, using a controlled simulation study and a number of practical data sets drawn from several areas of South African commerce. This is, as far as can be ascertained, the first extensive practical South African study on the effects of influential data. It is shown that the presence of outliers or influential data can bias the results of any study significantly. It is recommended that no data analysis should be attempted without a preliminary scan of outliers and influential observation. The techniques presented can be used advantageously even in data sets where the ultimate analysis does not involve linear regression. It is shown that influential data are not merely of nuisance value in the analysis but may contain valuable - information in their own right._
format	Thesis
id	oai:open.uct.ac.za:11427/38914
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:33:45.686Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2023
publishDateRange	2023
publishDateSort	2023
publisher	School of Economics
publisherStr	School of Economics
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/38914 Linear regression techniques for identifying influential data and applications in commercial data analysis Jacobs, Michael Kalman Troskie, Cas Influential data Recent literature contains many publications on techniques for identifying extreme data points (outliers) and influential observations or groups in sample data sets. This thesis begins by reviewing the statistics and distributional properties of the standard techniques, viz. the standardized residual as a test for outliers, and Cook's distance as a measure of influence. An outlier test which is distributionally neater than the standardized residual is proposed. In practical applications, ordinary least squares regression is often inappropriate, and the use of biased estimators may be preferable. In this thesis, the existing theory is extended to several alternative regression techniques. Ridge regression and generalized inverse regression are suitable techniques when the cross-product matrix is ill-conditioned. Restricted least squares regression, with exact or stochastic prior information, · is used in many econometric application~. Models with selected · variables-are used to eliminate design faults or to reduce computational effort. New statistics are developed for all these techniques, the distributional results are proved, and computational formulae are developed. Computational problems may arise in the actual use of the various techniques, and these are investigated. Computer programs written in BASIC and suitable for microcomputer use are presented, making the techniques accessible to virtually any commercial environment. The performance of the various techniques is examined, using a controlled simulation study and a number of practical data sets drawn from several areas of South African commerce. This is, as far as can be ascertained, the first extensive practical South African study on the effects of influential data. It is shown that the presence of outliers or influential data can bias the results of any study significantly. It is recommended that no data analysis should be attempted without a preliminary scan of outliers and influential observation. The techniques presented can be used advantageously even in data sets where the ultimate analysis does not involve linear regression. It is shown that influential data are not merely of nuisance value in the analysis but may contain valuable - information in their own right._ 2023-09-27T13:58:18Z 2023-09-27T13:58:18Z 1983 2023-09-27T13:53:04Z Doctoral Thesis Doctoral PhD http://hdl.handle.net/11427/38914 eng application/pdf School of Economics Faculty of Commerce
spellingShingle	Influential data Jacobs, Michael Kalman Linear regression techniques for identifying influential data and applications in commercial data analysis
thesis_degree_str	Doctoral
title	Linear regression techniques for identifying influential data and applications in commercial data analysis
title_full	Linear regression techniques for identifying influential data and applications in commercial data analysis
title_fullStr	Linear regression techniques for identifying influential data and applications in commercial data analysis
title_full_unstemmed	Linear regression techniques for identifying influential data and applications in commercial data analysis
title_short	Linear regression techniques for identifying influential data and applications in commercial data analysis
title_sort	linear regression techniques for identifying influential data and applications in commercial data analysis
topic	Influential data
url	http://hdl.handle.net/11427/38914
work_keys_str_mv	AT jacobsmichaelkalman linearregressiontechniquesforidentifyinginfluentialdataandapplicationsincommercialdataanalysis

Full Text Available

Linear regression techniques for identifying influential data and applications in commercial data analysis

Similar Items