Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Combining tree kernels and text embeddings for plagiarism detection

Thesis (MSc)--Stellenbosch University, 2018.

Saved in:
Bibliographic Details
Main Author: Thom, Jacobus Daniel
Other Authors: Van der Merwe, A. B.
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2018
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613774666530816
access_status_str Open Access
author Thom, Jacobus Daniel
author2 Van der Merwe, A. B.
author_browse Thom, Jacobus Daniel
Van der Merwe, A. B.
author_facet Van der Merwe, A. B.
Thom, Jacobus Daniel
author_sort Thom, Jacobus Daniel
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc)--Stellenbosch University, 2018.
format Thesis
id oai:scholar.sun.ac.za:10019.1/103550
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:41:29.531Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2018
publishDateRange 2018
publishDateSort 2018
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/103550 Combining tree kernels and text embeddings for plagiarism detection Thom, Jacobus Daniel Van der Merwe, A. B. Kroon, R. S. (Steve) Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences (Computer Science) Text embeddings Plagiarism -- Detection Tree kernels Syntactic structures Semantic structures Thesis (MSc)--Stellenbosch University, 2018. ENGLISH ABSTRACT : The internet allows for vast amounts of information to be accessed with ease. Consequently, it becomes much easier to plagiarize any of this information as well. Most plagiarism detection techniques rely on n-grams to find similarities between suspicious documents and possible sources. N-grams, due to their simplicity, do not make full use of all the syntactic and semantic information contained in sentences. We therefore investigated two methods, namely tree kernels applied to the parse trees of sentences and text embeddings, to utilize more syntactic and semantic information respectively. A plagiarism detector was developed using these techniques and its effectiveness was tested on the PAN 2009 and 2011 external plagiarism corpora. The detector achieved results that were on par with the state of the art for both PAN 2009 and PAN 2011. This indicates that the combination of tree kernel and text embedding techniques is a viable method of plagiarism detection. AFRIKAANSE OPSOMMING : Die internet laat mens toe om groot hoeveelhede inligting maklik in die hande te kry. Gevolglik word dit ook baie makliker om plagiaat op enige van hierdie inligting te pleeg. Meeste plagiaatopsporingstegnieke maak staat op n-gramme om ooreenkomste tussen verdagte dokumente en moontlike bronne op te spoor. Aangesien n-gramme taamlik eenvoudig is, maak hulle nie volle gebruik van al die syntaktiese en semantiese inligting wat sinne bevat nie. Ons ondersoek dus twee metodes, naamlik boomkernfunksies, wat toegepas word op die ontledingsbome van sinne, en teksinbeddings, om onderskeidelik meer sintaktiese en semantiese inligting te gebruik. 'n Plagiaatdetektor is ontwikkel met behulp van hierdie twee tegnieke en die e ektiwiteit daarvan is getoets op die PAN 2009 en 2011 eksterne plagiaatkorpora. Die detektor het resultate behaal wat vergelykbaar was met die beste vir beide PAN 2009 en PAN 2011. Dit dui aan dat die kombinasie van boomkern- en teksinbeddingstegnieke 'n redelike metode van plagiaatopsporing is. 2018-02-20T18:20:45Z 2018-04-09T07:00:11Z 2018-02-20T18:20:45Z 2018-04-09T07:00:11Z 2018-03 Thesis http://hdl.handle.net/10019.1/103550 en_ZA Stellenbosch University xii, 73 pages : illustrations (some colour) application/pdf Stellenbosch : Stellenbosch University
spellingShingle Text embeddings
Plagiarism -- Detection
Tree kernels
Syntactic structures
Semantic structures
Thom, Jacobus Daniel
Combining tree kernels and text embeddings for plagiarism detection
title Combining tree kernels and text embeddings for plagiarism detection
title_full Combining tree kernels and text embeddings for plagiarism detection
title_fullStr Combining tree kernels and text embeddings for plagiarism detection
title_full_unstemmed Combining tree kernels and text embeddings for plagiarism detection
title_short Combining tree kernels and text embeddings for plagiarism detection
title_sort combining tree kernels and text embeddings for plagiarism detection
topic Text embeddings
Plagiarism -- Detection
Tree kernels
Syntactic structures
Semantic structures
url http://hdl.handle.net/10019.1/103550
work_keys_str_mv AT thomjacobusdaniel combiningtreekernelsandtextembeddingsforplagiarismdetection