Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP

Saved in:
Bibliographic Details
Published in:Journal of Open Source Software
Format: Online Article RSS Article
Published: 2026
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864030191032991748
collection WordPress RSS
FRELIP Feed Integration
container_title Journal of Open Source Software
description
discipline_display Engineering & Technology
discipline_facet Engineering & Technology
format Online Article
RSS Article
genre Journal Article
id rss_article:9040
institution FRELIP
journal_source_facet Journal of Open Source Software
publishDate 2026
publishDateSort 2026
record_format rss_article
spellingShingle KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
Computer Science & Information Science
Computer Science & IT
Engineering & Technology
sub_discipline_display Computer Science & IT
sub_discipline_facet Computer Science & IT
subject_display Computer Science & Information Science
Computer Science & IT
Engineering & Technology
Computer Science & Information Science
Computer Science & IT
Engineering & Technology
subject_facet Computer Science & Information Science
Computer Science & IT
Engineering & Technology
title KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_auth KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_full KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_fullStr KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_full_unstemmed KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_short KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation and Streaming Offset Bundling for Julia NLP
title_sort keemenapreprocessing.jl: unicode-robust cleaning, multi-level tokenisation and streaming offset bundling for julia nlp
topic Computer Science & Information Science
Computer Science & IT
Engineering & Technology
url https://joss.theoj.org/papers/10.21105/joss.09348