Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

An investigation into performance-related issues of regular expression matching

Thesis (MSc) -- Stellenbosch University, 2022.

Saved in:
Bibliographic Details
Main Author: Van Litsenborgh, Pieter Steyn
Other Authors: Van der Merwe, Brink
Format: Thesis
Language:en_ZA
Published: Stellenbosch : Stellenbosch University 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867614112968605696
access_status_str Open Access
author Van Litsenborgh, Pieter Steyn
author2 Van der Merwe, Brink
author_browse Van Litsenborgh, Pieter Steyn
Van der Merwe, Brink
author_facet Van der Merwe, Brink
Van Litsenborgh, Pieter Steyn
author_sort Van Litsenborgh, Pieter Steyn
collection Thesis
dc_rights_str_mv Stellenbosch University
description Thesis (MSc) -- Stellenbosch University, 2022.
format Thesis
id oai:scholar.sun.ac.za:10019.1/126045
institution Stellenbosch University (South Africa)
language en_ZA
last_indexed 2026-06-10T12:46:52.458Z
license_str Other — see source repository
provenance_str_mv Harvested via OAI-PMH from SUNScholar — Stellenbosch University Repository
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Stellenbosch : Stellenbosch University
publisherStr Stellenbosch : Stellenbosch University
record_format dspace
source_str SUNScholar — Stellenbosch University Repository
spelling oai:scholar.sun.ac.za:10019.1/126045 An investigation into performance-related issues of regular expression matching Van Litsenborgh, Pieter Steyn Van der Merwe, Brink Bester, Willem Stellenbosch University. Faculty of Science. Dept. of Computer Science. Regular expressions (Computer science) Pattern recognition systems Programming languages (Electronic computers) UCTD Thesis (MSc) -- Stellenbosch University, 2022. ENGLISH ABSTRACT: Regular expressions (regexes) constitute a concise, powerful, and useful pattern matching language for strings. They are widely supported in programming languages, text processing programs, and (advanced) text editors. In general, regex matching is performed efficiently, but in some cases, a vulnerable regex (and an exploit string) can cause the matching procedure to take exponential time in the length of the input string. As regexes are frequently used in userfacing circumstances, vulnerable regexes open up the underlying application to a potential denial of service vector. Several regex engines choose to implement an algorithm that will always match an input string in linear time, but can only support a subset of modern regex constructs. We show how to implement a construct known as Regular Expressions with Lookahead (REwLA), and investigate various state complexity results when converting REwLA to equivalent deterministic finite automata (DFA). The relationship between REwLA with atomic operations and REwLA without is investigated, and an algorithm for translating a REwLA with atomic operations to a REwLA without is provided. Vulnerable regexes only exist when the regex engine implements a backtracking algorithm. We extend non-deterministic finite a utomata (NFA) and regexes by adding memoization to these formalisms. Furthermore, we generalise the concept of ambiguity in order to be applicable to memoized extensions. These extensions are aimed at improving the matching time of backtracking regex engines. AFRIKAANS OPSOMMING: Regulêre uitdrukkings (regekse) vorm ’n bondige, kragtige en bruikbare taal vir die patroon-ooreenstemming van stringe. Hulle word ondersteun in ’n groot hoeveelheid van programmeertale, teksverwerkingsprogramme en (gevorderde) teksredigeerders. Oor die algemeen word regeks-passing doeltreffend uitgevoer, maar in sommige gevalle kan ’n kwesbare regeks (en ’n uitbuit-string) veroorsaak dat die ooreenstemming prosedure eksponensiële tyd in die lengte van die invoerstring neem. Aangesien regekse gereeld gebruik word in omstandighede waar die gebruiker die hoof akteur is, maak dit kwesbare regekse oop tot ’n moontlike ontkenning van diens aanval. Verskeie regeks-enjins kies om ’n algoritme te implementeer wat altyd ’n invoerstring in lineêre tyd sal pas, maar kan slegs ’n deelversameling van moderne regeks-konstruksies ondersteun. Ons wys hoe om ’n konstruksie, bekend as “Regular Expressions with Lookahead” (REwLA) te implementeer en ondersoek verskeie toestandskompleksiteitsresultate wanneer REwLA na ekwivalente deterministiese eindige outomatiese (DFA) omgeskakel word. Die verband tussen REwLA met atomiese operatore en REwLA daarsonder word ondersoek en ’n algoritme vir die vertaling van REwLA met atomiese operatore na REwLA daarsonder word verskaf. Kwesbare regekse bestaan slegs wanneer die regex-enjin ’n terugspooralgoritme implementeer. Ons brei nie-deterministiese eindige outomatiese (NFA) en regekse uit deur memoisering by hierdie formalisme te voeg. Verder veralgemeen ons die konsep van dubbelsinnigheid om van toepassing te wees op gememoriseerde uitbreidings. Hierdie uitbreidings is daarop gemik om die passingstyd van terugsporing van regeks-enjins te verbeter. Masters 2022-11-24T03:24:11Z 2023-01-16T12:47:16Z 2022-11-24T03:24:11Z 2023-01-16T12:47:16Z 2022-11 Thesis http://hdl.handle.net/10019.1/126045 en_ZA Stellenbosch University vii, 96 pages : illustrations application/pdf Stellenbosch : Stellenbosch University
spellingShingle Regular expressions (Computer science)
Pattern recognition systems
Programming languages (Electronic computers)
UCTD
Van Litsenborgh, Pieter Steyn
An investigation into performance-related issues of regular expression matching
title An investigation into performance-related issues of regular expression matching
title_full An investigation into performance-related issues of regular expression matching
title_fullStr An investigation into performance-related issues of regular expression matching
title_full_unstemmed An investigation into performance-related issues of regular expression matching
title_short An investigation into performance-related issues of regular expression matching
title_sort investigation into performance related issues of regular expression matching
topic Regular expressions (Computer science)
Pattern recognition systems
Programming languages (Electronic computers)
UCTD
url http://hdl.handle.net/10019.1/126045
work_keys_str_mv AT vanlitsenborghpietersteyn aninvestigationintoperformancerelatedissuesofregularexpressionmatching
AT vanlitsenborghpietersteyn investigationintoperformancerelatedissuesofregularexpressionmatching