Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020.
| Other Authors: | |
|---|---|
| Format: | Thesis |
| Language: | English |
| Published: |
University of Pretoria
2020
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| _version_ | 1867613453144817664 |
|---|---|
| access_status_str | Open Access |
| author2 | De Waal, Alta |
| author_browse | De Waal, Alta |
| author_facet | De Waal, Alta |
| collection | Thesis |
| dc_rights_str_mv | © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. |
| description | Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020. |
| format | Thesis |
| id | oai:repository.up.ac.za:2263/73230 |
| institution | University of Pretoria (South Africa) |
| language | English |
| last_indexed | 2026-06-10T12:36:23.211Z |
| license_str | Other — see source repository |
| provenance_str_mv | Harvested via OAI-PMH from UPSpace — University of Pretoria Institutional Repository |
| publishDate | 2020 |
| publishDateRange | 2020 |
| publishDateSort | 2020 |
| publisher | University of Pretoria |
| publisherStr | University of Pretoria |
| record_format | dspace |
| source_str | UPSpace — University of Pretoria Institutional Repository |
| spelling | oai:repository.up.ac.za:2263/73230 A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text De Waal, Alta u13075782@tuks.co.za Derks, Iena Petronella UCTD Statistics Mini Dissertation (MCom (Statistics))--University of Pretoria, 2020. With the increase in online social media interactions, the true identity of user profiles becomes increasingly doubtful. Fake profiles are used to engineer perceptions of opinions and also to create online relationships under false pretence. Natural language text -- how the user structures a sentence and uses words -- provides useful information to discover expected patterns, given the assumed social profile of the user. We expect, for example, different word use and sentence structures from teenagers than from adults. Sociolinguistics is the study of language in the context of social factors such as age, culture and common interest. Natural language processing (NLP) provides quantitative methods to discover sociolinguistic patterns in text data. Current NLP methods make use of a multinomial naïve Bayes classifier to classify unseen documents into predefined sociolinguistic classes. One property of language that is not captured in binomial or multinomial models, is that of burstiness. Burstiness defines the phenomenon that if a person uses a word, they are more likely to use that word again. Thus, the independence assumption between respective counts of the same word is relaxed. The Poisson distribution family captures this phenomenon and in the field of biostatistics, it is often referred to as contagious distributions (because the counts between contagious diseases is not independent). In this research, we relax this count independence assumption of the naïve Bayes classifier by replacing the baseline multinomial likelihood function with a Poisson likelihood function. In the second stage of the NLP pipeline, we use the top words identified in each class to explore the conditional dependencies between these words. For this purpose, an unsupervised Bayesian network is trained on a Bag-of-Words vectorisation of the top words. The output of the second stage is an exploration of the sociolinguistic patterns among different groups of people. The proposed methodology is applied to two data sets. In both cases, the contagious naïve Bayes classifier achieved the best results and we were able to extract word dependency structures from the Bayesian network learning. The methods developed in this research has the potential to aid security institutions, forensic investigations, and market researchers in identifying valuable sociolinguistic features associated with social groups of interest. Center for Artificial Intelligence (CAIR) Statistics MCom (Statistics) Unrestricted 2020-02-12T06:50:23Z 2020-02-12T06:50:23Z 2020-04-15 2020 Mini Dissertation Derks, IP 2020, A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text, MCom Mini-dissertation, University of Pretoria, Pretoria A2020 http://hdl.handle.net/2263/73230 en © 2019 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. application/pdf University of Pretoria |
| spellingShingle | UCTD Statistics A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title | A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title_full | A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title_fullStr | A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title_full_unstemmed | A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title_short | A two-stage contagious naïve Bayes classifier for detecting sociolinguistic features in text |
| title_sort | two stage contagious naive bayes classifier for detecting sociolinguistic features in text |
| topic | UCTD Statistics |
| url | http://hdl.handle.net/2263/73230 |