Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Evaluation and development of conceptual document similarity metrics with content-based recommender applications

Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010.

Saved in:
Bibliographic Details
Main Author: Gouws, Stephan
Other Authors: Van Rooyen, G-J.
Format: Thesis
Language:English
Published: Stellenbosch : University of Stellenbosch 2010
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614019151462400
access_status_str Open Access
author Gouws, Stephan
author2 Van Rooyen, G-J.
author_browse Gouws, Stephan
Van Rooyen, G-J.
author_facet Van Rooyen, G-J.
Gouws, Stephan
author_sort Gouws, Stephan
collection Thesis
dc_rights_str_mv University of Stellenbosch
description Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010.
format Thesis
id oai:scholar.sun.ac.za:10019.1/5363
institution Stellenbosch University (South Africa)
language English
last_indexed 2026-06-10T12:45:22.846Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2010
publishDateRange 2010
publishDateSort 2010
publisher Stellenbosch : University of Stellenbosch
publisherStr Stellenbosch : University of Stellenbosch
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/5363 Evaluation and development of conceptual document similarity metrics with content-based recommender applications Gouws, Stephan Van Rooyen, G-J. Engelbrecht, H. A. University of Stellenbosch. Faculty of Engineering. Dept. of Electrical and Electronic Engineering. Document similarity Wikipedia Spreading activation Information retrieval Dissertations -- Electronic engineering Theses -- Electronic engineering Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. ENGLISH ABSTRACT: The World Wide Web brought with it an unprecedented level of information overload. Computers are very effective at processing and clustering numerical and binary data, however, the automated conceptual clustering of natural-language data is considerably harder to automate. Most past techniques rely on simple keyword-matching techniques or probabilistic methods to measure semantic relatedness. However, these approaches do not always accurately capture conceptual relatedness as measured by humans. In this thesis we propose and evaluate the use of novel Spreading Activation (SA) techniques for computing semantic relatedness, by modelling the article hyperlink structure of Wikipedia as an associative network structure for knowledge representation. The SA technique is adapted and several problems are addressed for it to function over the Wikipedia hyperlink structure. Inter-concept and inter-document similarity metrics are developed which make use of SA to compute the conceptual similarity between two concepts and between two natural-language documents. We evaluate these approaches over two document similarity datasets and achieve results which compare favourably with the state of the art. Furthermore, document preprocessing techniques are evaluated in terms of the performance gain these techniques can have on the well-known cosine document similarity metric and the Normalised Compression Distance (NCD) metric. Results indicate that a near two-fold increase in accuracy can be achieved for NCD by applying simple preprocessing techniques. Nonetheless, the cosine similarity metric still significantly outperforms NCD. Finally, we show that using our Wikipedia-based method to augment the cosine vector space model provides superior results to either in isolation. Combining the two methods leads to an increased correlation of Pearson p = 0:72 over the Lee (2005) document similarity dataset, which matches the reported result for the state-of-the-art Explicit Semantic Analysis (ESA) technique, while requiring less than 10% of the Wikipedia database as required by ESA. As a use case for document similarity techniques, a purely content-based news-article recommender system is designed and implemented for a large online media company. This system is used to gather additional human-generated relevance ratings which we use to evaluate the performance of three state-of-the-art document similarity metrics for providing content-based document recommendations. AFRIKAANSE OPSOMMING: Die Wêreldwye-Web het ’n vlak van inligting-oorbelading tot gevolg gehad soos nog nooit tevore. Rekenaars is baie effektief met die verwerking en groepering van numeriese en binêre data, maar die konsepsuele groepering van natuurlike-taal data is aansienlik moeiliker om te outomatiseer. Tradisioneel berus sulke algoritmes op eenvoudige sleutelwoordherkenningstegnieke of waarskynlikheidsmetodes om semantiese verwantskappe te bereken, maar hierdie benaderings modelleer nie konsepsuele verwantskappe, soos gemeet deur die mens, baie akkuraat nie. In hierdie tesis stel ons die gebruik van ’n nuwe aktiverings-verspreidingstrategie (AV) voor waarmee inter-konsep verwantskappe bereken kan word, deur die artikel skakelstruktuur van Wikipedia te modelleer as ’n assosiatiewe netwerk. Die AV tegniek word aangepas om te funksioneer oor die Wikipedia skakelstruktuur, en verskeie probleme wat hiermee gepaard gaan word aangespreek. Inter-konsep en inter-dokument verwantskapsmaatstawwe word ontwikkel wat gebruik maak van AV om die konsepsuele verwantskap tussen twee konsepte en twee natuurlike-taal dokumente te bereken. Ons evalueer hierdie benadering oor twee dokument-verwantskap datastelle en die resultate vergelyk goed met die van ander toonaangewende metodes. Verder word teks-voorverwerkingstegnieke ondersoek in terme van die moontlike verbetering wat dit tot gevolg kan hê op die werksverrigting van die bekende kosinus vektorruimtemaatstaf en die genormaliseerde kompressie-afstandmaatstaf (GKA). Resultate dui daarop dat GKA se akkuraatheid byna verdubbel kan word deur gebruik te maak van eenvoudige voorverwerkingstegnieke, maar dat die kosinus vektorruimtemaatstaf steeds aansienlike beter resultate lewer. Laastens wys ons dat die Wikipedia-gebasseerde metode gebruik kan word om die vektorruimtemaatstaf aan te vul tot ’n gekombineerde maatstaf wat beter resultate lewer as enige van die twee metodes afsonderlik. Deur die twee metodes te kombineer lei tot ’n verhoogde korrelasie van Pearson p = 0:72 oor die Lee dokument-verwantskap datastel. Dit is gelyk aan die gerapporteerde resultaat vir Explicit Semantic Analysis (ESA), die huidige beste Wikipedia-gebasseerde tegniek. Ons benadering benodig egter minder as 10% van die Wikipedia databasis wat benodig word vir ESA. As ’n toetstoepassing vir dokument-verwantskaptegnieke ontwerp en implementeer ons ’n stelsel vir ’n aanlyn media-maatskappy wat nuusartikels aanbeveel vir gebruikers, slegs op grond van die artikels se inhoud. Joernaliste wat die stelsel gebruik ken ’n punt toe aan elke aanbeveling en ons gebruik hierdie data om die akkuraatheid van drie toonaangewende maatstawwe vir dokument-verwantskap te evalueer in die konteks van inhoud-gebasseerde nuus-artikel aanbevelings. 2010-11-15T14:45:05Z 2010-12-15T10:38:03Z 2010-11-15T14:45:05Z 2010-12-15T10:38:03Z 2010-12 Thesis http://hdl.handle.net/10019.1/5363 en University of Stellenbosch 105 p. : ill. application/pdf Stellenbosch : University of Stellenbosch
spellingShingle Document similarity
Wikipedia
Spreading activation
Information retrieval
Dissertations -- Electronic engineering
Theses -- Electronic engineering
Gouws, Stephan
Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title_full Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title_fullStr Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title_full_unstemmed Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title_short Evaluation and development of conceptual document similarity metrics with content-based recommender applications
title_sort evaluation and development of conceptual document similarity metrics with content based recommender applications
topic Document similarity
Wikipedia
Spreading activation
Information retrieval
Dissertations -- Electronic engineering
Theses -- Electronic engineering
url http://hdl.handle.net/10019.1/5363
work_keys_str_mv AT gouwsstephan evaluationanddevelopmentofconceptualdocumentsimilaritymetricswithcontentbasedrecommenderapplications