Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text

Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020.

Saved in:
Bibliographic Details
Other Authors: De Waal, Alta
Format: Thesis
Language:English
Published: University of Pretoria 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613453144817664
access_status_str Open Access
author2 De Waal, Alta
author_browse De Waal, Alta
author_facet De Waal, Alta
collection Thesis
dc_rights_str_mv © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020.
format Thesis
id oai:repository.up.ac.za:2263/73230
institution University of Pretoria (South Africa)
language English
last_indexed 2026-06-10T12:36:23.211Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate 2020
publishDateRange 2020
publishDateSort 2020
publisher University of Pretoria
publisherStr University of Pretoria
record_format dspace
source_str UPSpace — University of Pretoria Institutional Repository
spelling oai:repository.up.ac.za:2263/73230 A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text De Waal, Alta u13075782@tuks.co.za Derks, Iena Petronella UCTD Statistics Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020. With the increase in online social media interactions, the true identity of user profiles becomes increasingly doubtful. Fake profiles are used to engineer perceptions of opinions and also to create online relationships under false pretence. Natural language text -- how the user structures a sentence and uses words -- provides useful information to discover expected patterns, given the assumed social profile of the user. We expect, for example, different word use and sentence structures from teenagers than from adults. Sociolinguistics is the study of language in the context of social factors such as age, culture and common interest. Natural language processing (NLP) provides quantitative methods to discover sociolinguistic patterns in text data. Current NLP methods make use of a multinomial naïve Bayes classifier to classify unseen documents into predefined sociolinguistic classes. One property of language that is not captured in binomial or multinomial models, is that of burstiness. Burstiness defines the phenomenon that if a person uses a word, they are more likely to use that word again. Thus, the independence assumption between respective counts of the same word is relaxed. The Poisson distribution family captures this phenomenon and in the field of biostatistics, it is often referred to as contagious distributions (because the counts between contagious diseases is not independent). In this research, we relax this count independence assumption of the naïve Bayes classifier by replacing the baseline multinomial likelihood function with a Poisson likelihood function. In the second stage of the NLP pipeline, we use the top words identified in each class to explore the conditional dependencies between these words. For this purpose, an unsupervised Bayesian network is trained on a Bag-of-Words vectorisation of the top words. The output of the second stage is an exploration of the sociolinguistic patterns among different groups of people. The proposed methodology is applied to two data sets. In both cases, the contagious naïve Bayes classifier achieved the best results and we were able to extract word dependency structures from the Bayesian network learning. The methods developed in this research has the potential to aid security institutions, forensic investigations, and market researchers in identifying valuable sociolinguistic features associated with social groups of interest. Center for Artificial Intelligence (CAIR) Statistics MCom (Statistics) Unrestricted 2020-02-12T06:50:23Z 2020-02-12T06:50:23Z 2020-04-15 2020 Mini Dissertation Derks, IP 2020, A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text, MCom Mini-dissertation, University of Pretoria, Pretoria A2020 http://hdl.handle.net/2263/73230 en © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle UCTD
Statistics
A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_full A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_fullStr A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_full_unstemmed A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_short A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_sort two stage contagious naive bayes classifier for detecting sociolinguistic features in text
topic UCTD
Statistics
url http://hdl.handle.net/2263/73230