UAX29 3.0.0

dotnet add package UAX29 --version 3.0.0
                    
NuGet\Install-Package UAX29 -Version 3.0.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="UAX29" Version="3.0.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="UAX29" Version="3.0.0" />
                    
Directory.Packages.props
<PackageReference Include="UAX29" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add UAX29 --version 3.0.0
                    
#r "nuget: UAX29, 3.0.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package UAX29@3.0.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=UAX29&version=3.0.0
                    
Install as a Cake Addin
#tool nuget:?package=UAX29&version=3.0.0
                    
Install as a Cake Tool

This package tokenizes (splits) words, sentences and graphemes, based on Unicode text segmentation (UAX #29), for Unicode version 15.0.0.

Why tokenize?

Any time our code operates on individual words, we are tokenizing. Often, we do it ad hoc, such as splitting on spaces, which gives inconsistent results. The Unicode standard is better: it is multi-lingual, and handles punctuation, special characters, etc.

Example

⚠️ This documentation on main refers to v3, which is not yet published on Nuget. See v2 documentation until then.

dotnet add package UAX29
using UAX29;
using System.Text;

var example = "Hello, 🌏 world. δ½ ε₯½οΌŒδΈ–η•Œ.";

// The tokenizer can split words, graphemes or sentences.
// It operates on strings, UTF-8 bytes, and streams.

var words = Split.Words(example);

// Iterate over the tokens
foreach (var word in words)
{
    // word is ReadOnlySpan<char>
    // If you need it back as a string:
    Console.WriteLine(word.ToString());
}

/*
Hello
,

🌏

world
.

δ½ 
ε₯½
,
δΈ–
η•Œ
.
*/

var utf8bytes = Encoding.UTF8.GetBytes(example);
var graphemes = Split.Graphemes(utf8bytes);

// Iterate over the tokens
foreach (var grapheme in graphemes)
{
    // grapheme is a ReadOnlySpan<byte> of UTF-8 bytes
    // If you need it back as a string:
    var s = Encoding.UTF8.GetString(grapheme);
    Console.WriteLine(s);
}

/*
H
e
l
l
o
,

🌏

w
o
r
l
d
.

δ½ 
ε₯½
,
δΈ–
η•Œ
.
*/

There are also optional extension methods in the spirit of string.Split:

using UAX29.Extensions;

example.SplitWords();

Data types

For UTF-8 bytes, pass byte[], Span<byte> or Stream; the resulting tokens will be ReadOnlySpan<byte>.

For strings/chars, pass string, char[], Span<char> or TextReader/StreamReader; the resulting tokens will be ReadOnlySpan<char>.

If you have Memory<byte|char>, pass Memory.Span.

Conformance

We use the official Unicode test suites. Status:

.NET

This is the same spec that is implemented in Lucene's StandardTokenizer.

Performance

When tokenizing words, I get around 120MB/s on my Macbook M2. For typical text, that's around 30 million tokens/s. Benchmarks

The tokenizer is implemented as a ref struct, so you should see zero allocations for static text such as byte[] or string/char.

Calling Split.Words returns a lazy enumerator, and will not allocate per-token. There are ToList and ToArray methods for convenience, which will allocate.

For Stream or TextReader/StreamReader, a buffer needs to be allocated behind the scenes. You can specify the size when calling Split.Words. You can also optionally pass your own byte[] or char[] to do your own allocation, perhaps with ArrayPool. Or, you can re-use the buffer by calling SetStream on an existing tokenizer, which will avoid re-allocation.

Options

Pass Options.OmitWhitespace if you would like whitespace-only tokens not to be returned (for words only).

Invalid inputs

The tokenizer expects valid (decodable) UTF-8 bytes or UTF-16 chars as input. We make an effort to ensure that all bytes will be returned even if invalid, i.e. to be lossless in any case, though the resulting tokenization may not be useful. Garbage in, garbage out.

Major version changes

v2 β†’ v3

Renamed methods:

Tokenizer.GetWords(input) β†’ Split.Words(input)

v1 β†’ v2

Renamed package, namespace and methods:

dotnet add package uax29.net β†’ dotnet add package UAX29

using uax29 β†’ using UAX29

Tokenizer.Create(input) β†’ Tokenizer.GetWords(input)

Tokenizer.Create(input, TokenType.Graphemes) β†’ Tokenizer.GetGraphemes(input)

Prior art

clipperhouse/uax29

I previously implemented this for Go.

StringInfo.GetTextElementEnumerator

The .Net Core standard library has a similar enumerator for graphemes.

Other language implementations

Java

JavaScript

Rust

Python

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net8.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
3.0.0 1,181 7/22/2024
2.2.0 166 7/9/2024
2.1.0 161 7/7/2024
2.0.3 159 6/21/2024