Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Modelling highly imbalanced credit card fraud detection data using statistical learning

Credit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case w...

Full description

Saved in:

Bibliographic Details
Main Author:	Moodley, Revesa
Other Authors:	Britz, Stefan
Format:	Thesis
Language:	English
Published:	Department of Statistical Sciences 2024
Subjects:	Statistical Sciences
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613146089259008
access_status_str	Open Access
author	Moodley, Revesa
author2	Britz, Stefan
author_browse	Britz, Stefan Moodley, Revesa
author_facet	Britz, Stefan Moodley, Revesa
author_sort	Moodley, Revesa
collection	Thesis
description	Credit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case with fraud detection. Whilst different sampling techniques are generally used to reduce the imbalance, minimal studies have focussed on the effect the level imbalance has on the predictive capabilities of various statistical learning techniques. This study investigates the effect of three factors on model performance: 1) sampling technique, 2) supervised learning method, and 3) prevalence rate, also known as imbalance ratio (IR), which refers to the proportion of majority class samples compared to that of the minority class. Three sampling techniques are utilised in the study: Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling (RUS). These methods are used to create varying levels of imbalance in the datatset, at the prevalence rates of 0.2%, 1%, 10%, 20%, 30%, 40%, and 50%. Six supervised learning models are then used to identify fraudulent transactions: Logistic Regression (LR), C4.5 Decision Trees (DT), Random Forests (RF), XGBoost, and Neural Network (NN) models. Precision, recall and F2 score are the primary metrics used to assess model performance. The results suggest that the ROS and SMOTE sampling techniques performed best in terms of F2 score. The best supervised learning models are RF and XGBoost. The tree models were generally well suited to the imbalanced dataset, whilst LR performed the worst, even when applying regularisation. Increasing the prevalence rate surprisingly yielded a decrease in performance. The findings from the experiments can serve as a foundation for selecting the best sampling technique and supervised learning models to utilize with various degrees of dataset imbalance.
format	Thesis
id	oai:open.uct.ac.za:11427/39715
institution	University of Cape Town (South Africa)
language	eng
last_indexed	2026-06-10T12:31:30.019Z
license_str	Not specified — see source repository
provenance_str_mv	Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate	2024
publishDateRange	2024
publishDateSort	2024
publisher	Department of Statistical Sciences
publisherStr	Department of Statistical Sciences
record_format	dspace
source_str	UCTD — University of Cape Town Open Access Repository
spelling	oai:open.uct.ac.za:11427/39715 Modelling highly imbalanced credit card fraud detection data using statistical learning Moodley, Revesa Britz, Stefan Statistical Sciences Credit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case with fraud detection. Whilst different sampling techniques are generally used to reduce the imbalance, minimal studies have focussed on the effect the level imbalance has on the predictive capabilities of various statistical learning techniques. This study investigates the effect of three factors on model performance: 1) sampling technique, 2) supervised learning method, and 3) prevalence rate, also known as imbalance ratio (IR), which refers to the proportion of majority class samples compared to that of the minority class. Three sampling techniques are utilised in the study: Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling (RUS). These methods are used to create varying levels of imbalance in the datatset, at the prevalence rates of 0.2%, 1%, 10%, 20%, 30%, 40%, and 50%. Six supervised learning models are then used to identify fraudulent transactions: Logistic Regression (LR), C4.5 Decision Trees (DT), Random Forests (RF), XGBoost, and Neural Network (NN) models. Precision, recall and F2 score are the primary metrics used to assess model performance. The results suggest that the ROS and SMOTE sampling techniques performed best in terms of F2 score. The best supervised learning models are RF and XGBoost. The tree models were generally well suited to the imbalanced dataset, whilst LR performed the worst, even when applying regularisation. Increasing the prevalence rate surprisingly yielded a decrease in performance. The findings from the experiments can serve as a foundation for selecting the best sampling technique and supervised learning models to utilize with various degrees of dataset imbalance. 2024-05-27T08:46:14Z 2024-05-27T08:46:14Z 2023 2024-05-22T08:11:45Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/39715 eng application/pdf Department of Statistical Sciences Faculty of Science
spellingShingle	Statistical Sciences Moodley, Revesa Modelling highly imbalanced credit card fraud detection data using statistical learning
thesis_degree_str	Master's
title	Modelling highly imbalanced credit card fraud detection data using statistical learning
title_full	Modelling highly imbalanced credit card fraud detection data using statistical learning
title_fullStr	Modelling highly imbalanced credit card fraud detection data using statistical learning
title_full_unstemmed	Modelling highly imbalanced credit card fraud detection data using statistical learning
title_short	Modelling highly imbalanced credit card fraud detection data using statistical learning
title_sort	modelling highly imbalanced credit card fraud detection data using statistical learning
topic	Statistical Sciences
url	http://hdl.handle.net/11427/39715
work_keys_str_mv	AT moodleyrevesa modellinghighlyimbalancedcreditcardfrauddetectiondatausingstatisticallearning

Full Text Available

Modelling highly imbalanced credit card fraud detection data using statistical learning

Similar Items