Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant

Thesis (MEng)--Stellenbosch University, 2017.

Saved in:
Bibliographic Details
Main Author: Coomans, Cornelius Johannes
Other Authors: Auret, Lidia
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2017
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613913284083712
access_status_str Open Access
author Coomans, Cornelius Johannes
author2 Auret, Lidia
author_browse Auret, Lidia
Coomans, Cornelius Johannes
author_facet Auret, Lidia
Coomans, Cornelius Johannes
author_sort Coomans, Cornelius Johannes
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MEng)--Stellenbosch University, 2017.
format Thesis
id oai:scholar.sun.ac.za:10019.1/101158
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:43:41.995Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2017
publishDateRange 2017
publishDateSort 2017
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/101158 Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant Coomans, Cornelius Johannes Auret, Lidia Burger, A. J. Swartz, C. D. Stellenbosch University. Faculty of Engineering. Dept. of Process Engineering. Water reuse -- Statistical methods Surrogate and indicator variables Statistics Surrogate-based optimization Multivariate statistics Monitoring techniques UCTD Thesis (MEng)--Stellenbosch University, 2017. ENGLISH SUMMARY: The lag time associated with water quality monitoring at water reclamation plants (WRPs) is a major hurdle in the way of implementing potable water reclamation in areas suffering from water shortages. The application of advanced monitoring techniques, which rely in part on surrogate and indicator variables, are one way of reducing the lag time associated with water quality monitoring. The aim of this study was to evaluate statistical analyses that could be used to identify variable relationships, which in turn could be used for the development of surrogate and indicator variables, following the data-driven approach. The plant data used in this study were obtained from an existing WRP that has been operational for more than five years without undergoing any major changes to the treatment and operational procedures. An initial assessment of the data found that the data contained large amounts of missing values. The assessment also identified the data periods during which the plant was operating under ‘normal’ conditions. Several time periods were removed since abnormal events occurred during these time periods. Pre-processing the data consisted of outlier removal (three sigma rule and Hampel filter), noise reduction (moving average filter) and missing data replacement (linear interpolation). The statistical analyses, Pearson’s and Spearman’s correlation, principal component analysis (PCA), linear discriminant analysis (LDA) and partial least squares (PLS) regression, were then incorporated into models for identifying variable relationships. The performance of the different statistical analyses were measured using statistical metrics such as R2 for correlation, visualisation of separation for PCA, classification error for LDA and both R2 and mean squared error (MSE) for the PLS models. The bivariate correlations provided the most concise results, whilst the LDA models could not be effectively assessed due to a change in the behaviour of the training and testing data. The PLS models performed poorly and did not produce any significant results. Expert process knowledge was also used to determine which variable relationships, identified by the models, could be regarded as valuable contributions, and which ought to be regarded as trivial. Overall it was found that the bivariate correlations were effective for detecting relationships between variables. PCA was a valuable tool that provided insight into the potential use of multivariate analyses. LDA and PLS regression may require further testing before a definitive ruling can be made regarding their usefulness for identifying variable relationships from unprocessed historical plant data. Although historical data could be used to identify variable relationships using bivariate correlations, it is not recommended for multivariate statistical analyses. A planned sampling campaign could be much more effective for data collection than using historical data, although the cost associated with a planned sampling campaign must be taken into consideration. AFRIKAANS OPSOMMING: Die tydsverloop wat verband hou met watergehaltemonitering by waterherwinningswerke (WHW’s) is ʼn groot hindernis vir die implementering van drinkbarewaterherwinning in gebiede wat onder watertekorte gebuk gaan. Die toepassing van gevorderde moniteringstegnieke wat gedeeltelik staatmaak op surrogaat- en aanwyserveranderlikes is een manier om hierdie tydsverloop te verminder. Die doel van hierdie studie was om statistiese ontledings te evalueer wat gebruik kan word om veranderlike verhoudings, wat aangewend kan word vir die ontwikkeling van surrogaat- en aanwyserveranderlikes, op grond van die data-gedrewe benadering te identifiseer. Die aanlegdata wat vir hierdie navorsing gebruik is, verkry vanaf ʼn bestaande WHW wat reeds vir vyf jaar werksaam is sonder dat enige groot veranderinge aan behandelings en bedryfsprosedures ondergaan is. Deur ʼn aanvanklike assessering van die data is bevind dat die data groot hoeveelhede ontbrekende waardes bevat. Met die assessering is datatydperke ook geïdentifiseer waartydens die aanleg onder ‘normale’ omstandighede bedryf is. Verskeie tydperke is verwyder aangesien abnormale gebeure daartydens plaasgevind het. Voorverwerking van die data het begin met uitskieterverwydering (driesigma-reël en Hampel-filter), geraasvermindering (bewegendegemiddelde-filter) en ontbrekendedata-vervanging (lineêre interpolasie). Die statistiese ontledings, Pearson en Spearman se korrelasie, hoofkomponentontleding (PCA), lineêre diskriminantontleding (LDA) en gedeeltelike kleinste kwadrate- (PLS-)regressie is in modelle gebruik vir die identifisering van veranderlike verhoudings. Die prestasie van die statistiese ontledings is gemeet met behulp van statistiese maatstawwe soos R2 vir korrelasie, visualisering van skeiding vir PCA, klassifikasiefout vir LDA en sowel R2 as gemiddelde kwadraatfout vir die PLS-modelle. Die tweeveranderlike korrelasies het die bondigste resultate getoon, terwyl die LDA-modelle nie doeltreffend beoordeel kon word nie as gevolg van ʼn verandering in die gedrag van die opleiding- en toetsdata. Die PLS-modelle het swak presteer en het nie enige noemenswaardige resultate gelewer nie. Deskundige proseskennis is ook gebruik om te bepaal watter veranderlike verhoudings, wat deur die modelle geïdentifiseer is, as waardevolle bydraes beskou kon word, en watter as onbeduidend beskou behoort te word. In die algemeen is bevind dat die tweeveranderlike korrelasies doeltreffend was vir die identifisering van verwantskappe tussen veranderlikes. PCA was ʼn waardevolle instrument wat insig verskaf het in die potensiële gebruik van meerveranderlike ontledingstegnieke. LDA- en PLS-regressie vereis moontlik verdere toetsing voordat ʼn finale beslissing gemaak kan word met betrekking tot die nut daarvan vir die identifisering van veranderlike verhoudings deur gebruik te maak van onverwerkte historiese aanlegdata. Hoewel historiese data gebruik kon word om veranderlike verhoudings met behulp van tweeveranderlike korrelasies te identifiseer, word dit nie aanbeveel vir meerveranderlike statistiese ontledings nie. ʼn Beplande steekproefnemingsveldtog kan baie doeltreffender wees vir data-insameling as die gebruik van historiese data, hoewel die koste wat verband hou met ʼn beplande steekproefnemingsveldtog in ag geneem moet word. 2017-02-14T13:30:49Z 2017-03-29T12:15:06Z 2017-02-14T13:30:49Z 2017-03-29T12:15:06Z 2017-03 Thesis http://hdl.handle.net/10019.1/101158 en_ZA Stellenbosch University xlv, 125 pages application/pdf Stellenbosch : Stellenbosch University
spellingShingle Water reuse -- Statistical methods
Surrogate and indicator variables
Statistics
Surrogate-based optimization
Multivariate statistics
Monitoring techniques
UCTD
Coomans, Cornelius Johannes
Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title_full Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title_fullStr Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title_full_unstemmed Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title_short Evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
title_sort evaluation of statistical analyses for the identification of surrogates and indicators using historical plant data from a water reclamation plant
topic Water reuse -- Statistical methods
Surrogate and indicator variables
Statistics
Surrogate-based optimization
Multivariate statistics
Monitoring techniques
UCTD
url http://hdl.handle.net/10019.1/101158
work_keys_str_mv AT coomanscorneliusjohannes evaluationofstatisticalanalysesfortheidentificationofsurrogatesandindicatorsusinghistoricalplantdatafromawaterreclamationplant