Full Text Available

Access Repository

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text

Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020.

Saved in:

Bibliographic Details
Other Authors:	De Waal, Alta
Format:	Thesis
Language:	English
Published:	University of Pretoria 2020
Subjects:	UCTD Statistics
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1867613453144817664
access_status_str	Open Access
author2	De Waal, Alta
author_browse	De Waal, Alta
author_facet	De Waal, Alta
collection	Thesis
dc_rights_str_mv	© 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
description	Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020.
format	Thesis
id	oai:repository.up.ac.za:2263/73230
institution	University of Pretoria (South Africa)
language	English
last_indexed	2026-06-10T12:36:23.211Z
license_str	Other — see source repository
provenance_str_mv	Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository
publishDate	2020
publishDateRange	2020
publishDateSort	2020
publisher	University of Pretoria
publisherStr	University of Pretoria
record_format	dspace
source_str	UPSpace — University of Pretoria Institutional Repository
spelling	oai:repository.up.ac.za:2263/73230 A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text De Waal, Alta u13075782@tuks.co.za Derks, Iena Petronella UCTD Statistics Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020. With the increase in online social media interactions, the true identity of user profiles becomes increasingly doubtful. Fake profiles are used to engineer perceptions of opinions and also to create online relationships under false pretence. Natural language text -- how the user structures a sentence and uses words -- provides useful information to discover expected patterns, given the assumed social profile of the user. We expect, for example, different word use and sentence structures from teenagers than from adults. Sociolinguistics is the study of language in the context of social factors such as age, culture and common interest. Natural language processing (NLP) provides quantitative methods to discover sociolinguistic patterns in text data. Current NLP methods make use of a multinomial naïve Bayes classifier to classify unseen documents into predefined sociolinguistic classes. One property of language that is not captured in binomial or multinomial models, is that of burstiness. Burstiness defines the phenomenon that if a person uses a word, they are more likely to use that word again. Thus, the independence assumption between respective counts of the same word is relaxed. The Poisson distribution family captures this phenomenon and in the field of biostatistics, it is often referred to as contagious distributions (because the counts between contagious diseases is not independent). In this research, we relax this count independence assumption of the naïve Bayes classifier by replacing the baseline multinomial likelihood function with a Poisson likelihood function. In the second stage of the NLP pipeline, we use the top words identified in each class to explore the conditional dependencies between these words. For this purpose, an unsupervised Bayesian network is trained on a Bag-of-Words vectorisation of the top words. The output of the second stage is an exploration of the sociolinguistic patterns among different groups of people. The proposed methodology is applied to two data sets. In both cases, the contagious naïve Bayes classifier achieved the best results and we were able to extract word dependency structures from the Bayesian network learning. The methods developed in this research has the potential to aid security institutions, forensic investigations, and market researchers in identifying valuable sociolinguistic features associated with social groups of interest. Center for Artificial Intelligence (CAIR) Statistics MCom (Statistics) Unrestricted 2020-02-12T06:50:23Z 2020-02-12T06:50:23Z 2020-04-15 2020 Mini Dissertation Derks, IP 2020, A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text, MCom Mini-dissertation, University of Pretoria, Pretoria A2020 http://hdl.handle.net/2263/73230 en © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria
spellingShingle	UCTD Statistics A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title	A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_full	A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_fullStr	A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_full_unstemmed	A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_short	A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text
title_sort	two stage contagious naive bayes classifier for detecting sociolinguistic features in text
topic	UCTD Statistics
url	http://hdl.handle.net/2263/73230

Full Text Available

A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text

Similar Items