FlashTokenizer 1.0.2

.NET 8.0

dotnet add package FlashTokenizer --version 1.0.2

NuGet\Install-Package FlashTokenizer -Version 1.0.2

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="FlashTokenizer" Version="1.0.2" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="FlashTokenizer" Version="1.0.2" />
                    

                            Directory.Packages.props

<PackageReference Include="FlashTokenizer" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add FlashTokenizer --version 1.0.2

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: FlashTokenizer, 1.0.2"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package FlashTokenizer@1.0.2

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=FlashTokenizer&version=1.0.2
                    

                            Install as a Cake Addin

#tool nuget:?package=FlashTokenizer&version=1.0.2
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

FlashTokenizer NuGet Package Guide

NuGet

About FlashTokenizer

FlashTokenizer is a high-performance, production-ready tokenization library for .NET 8 applications. It provides blazing-fast implementations of popular tokenization algorithms including BERT WordPiece and GPT-2 style BPE (Byte Pair Encoding).

What is Tokenization?

Tokenization is the process of breaking down text into smaller units (tokens) that machine learning models can understand. These tokens can be:

Words or subwords (like "hello", "world")
Token IDs (numerical representations like [101, 7592, 2088, 102])

Why FlashTokenizer?

Performance: Up to 12.7M tokens/sec throughput
Flexible: 8 different tokenizer classes for various use cases
Optimized: SIMD acceleration, parallel processing, async streaming
Production-Ready: Memory efficient, well-tested, comprehensive documentation
Multi-Language: Supports Chinese, multilingual text processing
Easy Integration: Simple NuGet package, clean APIs

Key Features

BERT WordPiece: Fast subword tokenization with Aho-Corasick tries
GPT-2 BPE: Byte Pair Encoding for transformer models
Parallel Processing: Multi-threaded tokenization for large documents
Async Streaming: Memory-efficient file processing
Bidirectional Fallback: Improved quality with dual-direction tokenization
UTF-8 Optimized: Proper Unicode handling and accent stripping
Configurable: Extensive options for different use cases

Use Cases

AI/ML Pipelines: Preprocessing for BERT, GPT, and transformer models
Data Processing: Large-scale text analysis and ETL workflows
Search Systems: Text indexing and retrieval applications
NLP Applications: Chatbots, sentiment analysis, text classification
Document Processing: Academic papers, legal documents, content analysis
Multilingual Systems: International text processing workflows

Installation

dotnet add package FlashTokenizer

Or via Package Manager Console in Visual Studio:

Install-Package FlashTokenizer

Quick Start

Basic Usage

using FlashTokenizer;

// Simple string tokenization
var tokenizer = new Tokenizer();
List<string> tokens = tokenizer.Tokenize("Hello, world!");

// BERT WordPiece tokenization  
var bertTokenizer = new FlashBertTokenizerOptimized("vocab.txt");
List<int> ids = bertTokenizer.Encode("Hello, world!");

Available Tokenizer Classes

1. `Tokenizer` - Simple String Tokenization

Basic text preprocessing that returns string tokens.

var tokenizer = new Tokenizer(
    doLowerCase: true,           // Convert to lowercase
    tokenizeChineseChars: true   // Add spaces around CJK characters
);

List<string> tokens = tokenizer.Tokenize("Hello, 世界!");
// Output: ["hello", ",", "世", "界", "!"]

Use cases:

Text preprocessing
Simple tokenization without subword splitting
When you need string tokens, not IDs

2. `FlashBertTokenizer` - Standard BERT WordPiece

Basic BERT tokenizer with WordPiece algorithm.

var tokenizer = new FlashBertTokenizer(
    vocabFile: "path/to/vocab.txt",
    doLowerCase: true,
    modelMaxLength: 512,         // Standard BERT length
    tokenizeChineseChars: true
);

// Encode text to token IDs
List<int> ids = tokenizer.Encode("Hello, world!");

// Decode back to text
string text = tokenizer.Decode(ids);

// With explicit parameters
List<int> ids2 = tokenizer.Encode(
    text: "Hello, world!",
    padding: "max_length",      // "max_length" or "longest"
    maxLength: 512
);

Use cases:

Standard BERT tokenization
Small to medium texts
When you need basic WordPiece functionality

3. `FlashBertTokenizerOptimized` - High-Performance BERT

Optimized version with better performance for production use.

var tokenizer = new FlashBertTokenizerOptimized(
    vocabFile: "vocab.txt",
    doLowerCase: true,
    modelMaxLength: -1,          // -1 = unlimited length
    tokenizeChineseChars: true
);

// For large documents, use unlimited length
List<int> ids = tokenizer.Encode(
    text: largeDocument,
    padding: "longest",          // No padding for large docs
    maxLength: -1               // Unlimited
);

Performance tips:

// Warmup for consistent benchmarking
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var warmupIds = tokenizer.Encode(text.Substring(0, Math.Min(1000, text.Length)));

// Actual tokenization
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var stopwatch = Stopwatch.StartNew();
var ids = tokenizer.Encode(text, "longest", -1);
stopwatch.Stop();

Use cases:

Production applications
Large documents (1KB - 1MB)
Performance-critical scenarios
Recommended for most use cases

4. `FlashBertTokenizerParallel` - Multi-threaded BERT

Parallel processing for very large documents.

var tokenizer = new FlashBertTokenizerParallel(
    vocabFile: "vocab.txt",
    doLowerCase: true,
    modelMaxLength: -1,
    tokenizeChineseChars: true,
    maxDegreeOfParallelism: Environment.ProcessorCount,  // Use all CPU cores
    chunkSize: 256 * 1024       // 256KB chunks
);

List<int> ids = tokenizer.Encode(veryLargeDocument);

// Don't forget to dispose
tokenizer.Dispose();

Configuration:

maxDegreeOfParallelism: Number of threads (default: CPU cores)
chunkSize: Size of text chunks in bytes (default: 256KB)

Use cases:

Very large files (> 1MB)
Multi-core systems
Batch processing

5. `AsyncTokenizerPipeline` - Async File Processing

Asynchronous streaming tokenization for files.

using var pipeline = new AsyncTokenizerPipeline(
    vocabFile: "vocab.txt",
    doLowerCase: true,
    modelMaxLength: -1,
    tokenizeChineseChars: true,
    maxDegreeOfParallelism: Environment.ProcessorCount,
    chunkSize: 128 * 1024,      // 128KB chunks
    bufferSize: 1024 * 1024     // 1MB buffer
);

// Process file directly
List<int> ids = await pipeline.ProcessFileAsync("large_file.txt");

// Process text asynchronously
List<int> ids2 = await pipeline.ProcessTextAsync(largeText);

Use cases:

File processing
Async/await patterns
Streaming scenarios
Memory-efficient processing

6. `FlashBertTokenizerBidirectional` - Robust Fallback

Uses bidirectional heuristic for improved quality.

var tokenizer = new FlashBertTokenizerBidirectional(
    vocabFile: "vocab.txt",
    doLowerCase: true,
    modelMaxLength: -1,
    tokenizeChineseChars: true
);

List<int> ids = tokenizer.Encode(
    text: complexText,
    padding: "longest",
    maxLength: -1
);

How it works:

Tokenizes text both forward and backward
Compares results using heuristics
Selects the better tokenization
Slightly slower but more robust

Use cases:

Complex or ambiguous text
Quality-critical applications
When standard tokenization produces poor results

7. `BpeTokenizer` - GPT-2 Style BPE

Byte Pair Encoding for GPT-2 style models.

var tokenizer = new BpeTokenizer(
    vocabJsonPath: "vocab.json",
    mergesPath: "merges.txt"
);

List<int> ids = tokenizer.Encode("The quick brown fox jumps over the lazy dog");
string text = tokenizer.Decode(ids);

Use cases:

GPT-2, GPT-3 style models
BPE-based tokenization
Non-BERT models

8. `FlashTokenizer` - Unified Facade

High-level facade that auto-selects the appropriate tokenizer.

// BERT WordPiece
var bertTokenizer = new FlashTokenizer(new TokenizerOptions
{
    VocabPath = "vocab.txt",
    DoLowerCase = true,
    ModelMaxLength = -1,        // Unlimited
    EnableBidirectional = false,
    Type = TokenizerType.Bert
});

// BPE
var bpeTokenizer = new FlashTokenizer(new TokenizerOptions
{
    Type = TokenizerType.BPE,
    BpeVocabJsonPath = "vocab.json",
    BpeMergesPath = "merges.txt"
});

// Enable bidirectional fallback
var robustTokenizer = new FlashTokenizer(new TokenizerOptions
{
    VocabPath = "vocab.txt",
    DoLowerCase = true,
    ModelMaxLength = -1,
    EnableBidirectional = true,  // More robust
    Type = TokenizerType.Bert
});

Performance Guidelines

Choosing the Right Tokenizer

Text Size	Recommended Class	Reason
< 1KB	`Tokenizer`, `FlashBertTokenizer`	Simple, low overhead
1KB - 100KB	`FlashBertTokenizerOptimized`	Best single-thread performance
100KB - 10MB	`FlashBertTokenizerParallel`	Multi-threading helps
> 10MB	`AsyncTokenizerPipeline`	Memory-efficient streaming
Any size + quality	`FlashBertTokenizerBidirectional`	Most robust

Performance Best Practices

1. Use Unlimited Length for Large Documents

// ✅ Good - unlimited length
var tokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, -1);
var ids = tokenizer.Encode(text, "longest", -1);

// ❌ Bad - causes early stopping
var tokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, 512);
var ids = tokenizer.Encode(text);  // Stops at 512 tokens

2. Proper Padding for Your Use Case

// For large documents (no padding needed)
var ids = tokenizer.Encode(text, "longest", -1);

// For fixed-size batches
var ids = tokenizer.Encode(text, "max_length", 512);

3. Warmup for Benchmarking

// Warmup JIT and GC
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var warmup = tokenizer.Encode("warmup text");

// Actual measurement
var sw = Stopwatch.StartNew();
var ids = tokenizer.Encode(actualText);
sw.Stop();

4. Parallel Processing Configuration

var tokenizer = new FlashBertTokenizerParallel(
    "vocab.txt", true, -1, true,
    Environment.ProcessorCount,  // Match CPU cores
    256 * 1024                  // Tune chunk size for your data
);

5. Memory Management

// Dispose parallel tokenizers
using var tokenizer = new FlashBertTokenizerParallel(...);

// Or manually
var tokenizer = new FlashBertTokenizerParallel(...);
try 
{
    var ids = tokenizer.Encode(text);
}
finally 
{
    tokenizer.Dispose();
}

Common Usage Patterns

Pattern 1: Simple Application

using FlashTokenizer;

class SimpleApp
{
    private static readonly FlashBertTokenizerOptimized _tokenizer = 
        new("vocab.txt", true, -1);
    
    public List<int> TokenizeText(string text)
    {
        return _tokenizer.Encode(text, "longest", -1);
    }
}

Pattern 2: Batch Processing

public async Task<List<List<int>>> ProcessFiles(string[] filePaths)
{
    using var pipeline = new AsyncTokenizerPipeline(
        "vocab.txt", true, -1, true,
        Environment.ProcessorCount, 128 * 1024, 1024 * 1024);
    
    var results = new List<List<int>>();
    foreach (var filePath in filePaths)
    {
        var ids = await pipeline.ProcessFileAsync(filePath);
        results.Add(ids);
    }
    return results;
}

Pattern 3: Configuration-Driven

public class TokenizerFactory
{
    public static ITokenizer Create(string configType, string vocabPath)
    {
        return configType.ToLower() switch
        {
            "fast" => new FlashBertTokenizerOptimized(vocabPath, true, -1),
            "parallel" => new FlashBertTokenizerParallel(vocabPath, true, -1, true, 
                Environment.ProcessorCount, 256 * 1024),
            "robust" => new FlashBertTokenizerBidirectional(vocabPath, true, -1),
            _ => new FlashBertTokenizer(vocabPath, true, -1)
        };
    }
}

Pattern 4: Quality vs Performance Trade-off

public List<int> TokenizeWithFallback(string text)
{
    // Try fast tokenizer first
    var fastTokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, -1);
    var ids = fastTokenizer.Encode(text, "longest", -1);
    
    // If result seems poor, use bidirectional
    if (ShouldUseBidirectional(text, ids))
    {
        var robustTokenizer = new FlashBertTokenizerBidirectional("vocab.txt", true, -1);
        ids = robustTokenizer.Encode(text, "longest", -1);
    }
    
    return ids;
}

Troubleshooting

Common Issues

1. Incomplete Tokenization

// Problem: Early stopping due to max length
var ids = tokenizer.Encode(text);  // Uses default max length

// Solution: Explicit unlimited
var ids = tokenizer.Encode(text, "longest", -1);

2. Memory Issues

// Problem: Not disposing parallel tokenizers
var tokenizer = new FlashBertTokenizerParallel(...);
// Memory leak!

// Solution: Use using statement
using var tokenizer = new FlashBertTokenizerParallel(...);

3. Circular Dependency (NuGet)

Error NU1108: Cycle detected
FlashTokenizer -> FlashTokenizer (>= 1.0.1)

Solution: Rename your project to something other than "FlashTokenizer".

Performance Comparison

Expected performance on a 4MB file (~759K tokens):

Tokenizer	Time	Throughput	Memory	Use Case
`FlashBertTokenizer`	~200ms	~3.8M tokens/sec	~500MB	Standard
`FlashBertTokenizerOptimized`	~110ms	~6.9M tokens/sec	~740MB	Recommended
`FlashBertTokenizerParallel`	~60ms	~12.7M tokens/sec	~800MB	Large files
`AsyncTokenizerPipeline`	~80ms	~9.5M tokens/sec	~600MB	File processing
`FlashBertTokenizerBidirectional`	~150ms	~5.1M tokens/sec	~750MB	Quality-first

Results may vary based on hardware and text complexity.

Advanced Configuration

Custom Chunk Sizes

// For memory-constrained environments
var tokenizer = new FlashBertTokenizerParallel(
    "vocab.txt", true, -1, true,
    maxDegreeOfParallelism: 2,    // Fewer threads
    chunkSize: 64 * 1024         // Smaller chunks
);

// For high-memory systems
var tokenizer = new FlashBertTokenizerParallel(
    "vocab.txt", true, -1, true,
    maxDegreeOfParallelism: Environment.ProcessorCount * 2,
    chunkSize: 1024 * 1024       // 1MB chunks
);

Custom Buffer Sizes

using var pipeline = new AsyncTokenizerPipeline(
    "vocab.txt", true, -1, true,
    Environment.ProcessorCount,
    chunkSize: 256 * 1024,
    bufferSize: 4 * 1024 * 1024  // 4MB buffer for large files
);

Integration Examples

ASP.NET Core Service

public void ConfigureServices(IServiceCollection services)
{
    services.AddSingleton<ITokenizer>(provider =>
        new FlashBertTokenizerOptimized("vocab.txt", true, -1));
}

[ApiController]
public class TokenizerController : ControllerBase
{
    private readonly ITokenizer _tokenizer;
    
    public TokenizerController(ITokenizer tokenizer)
    {
        _tokenizer = tokenizer;
    }
    
    [HttpPost("tokenize")]
    public ActionResult<List<int>> Tokenize([FromBody] string text)
    {
        var ids = _tokenizer.Encode(text);
        return Ok(ids);
    }
}

Console Application

class Program
{
    static async Task Main(string[] args)
    {
        if (args.Length < 2)
        {
            Console.WriteLine("Usage: app <vocab_path> <input_file>");
            return;
        }
        
        string vocabPath = args[0];
        string inputFile = args[1];
        
        using var pipeline = new AsyncTokenizerPipeline(
            vocabPath, true, -1, true,
            Environment.ProcessorCount, 128 * 1024, 1024 * 1024);
        
        var stopwatch = Stopwatch.StartNew();
        var ids = await pipeline.ProcessFileAsync(inputFile);
        stopwatch.Stop();
        
        Console.WriteLine($"Tokenized {ids.Count:N0} tokens in {stopwatch.ElapsedMilliseconds:F2}ms");
        Console.WriteLine($"Throughput: {ids.Count / stopwatch.Elapsed.TotalSeconds:F0} tokens/sec");
    }
}

Documentation

GitHub Repository: FlashTokenizer on GitHub

Getting Help

Issues: Report bugs on GitHub Issues
Feature Requests: Suggest improvements via GitHub Discussions
Documentation: Check this guide and README.md on Github
Community: Join discussions in the repository

Contributing

We welcome contributions! Please see our contributing guidelines in the repository.

Performance Benchmarks

Real-world performance results:

Hardware: Modern multi-core CPU
Test File: 4MB document (~759K tokens)
Best Result: 60ms processing time (12.7M tokens/sec)

License

FlashTokenizer is released under the MIT License. See the LICENSE file in the repository for details.

Changelog

Version 1.0.1

Initial NuGet release
Complete BERT WordPiece implementation
GPT-2 BPE support
Parallel and async processing
Comprehensive documentation
Production-ready performance

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.2	231	10/2/2025
1.0.1	224	10/1/2025

FlashTokenizer 1.0.2

FlashTokenizer NuGet Package Guide

About FlashTokenizer

What is Tokenization?

Why FlashTokenizer?

Key Features

Use Cases

Installation

Quick Start

Basic Usage

Available Tokenizer Classes

1. Tokenizer - Simple String Tokenization

2. FlashBertTokenizer - Standard BERT WordPiece

3. FlashBertTokenizerOptimized - High-Performance BERT

4. FlashBertTokenizerParallel - Multi-threaded BERT

5. AsyncTokenizerPipeline - Async File Processing

6. FlashBertTokenizerBidirectional - Robust Fallback

7. BpeTokenizer - GPT-2 Style BPE

8. FlashTokenizer - Unified Facade

Performance Guidelines

Choosing the Right Tokenizer

Performance Best Practices

1. Use Unlimited Length for Large Documents

2. Proper Padding for Your Use Case

3. Warmup for Benchmarking

4. Parallel Processing Configuration

5. Memory Management

Common Usage Patterns

Pattern 1: Simple Application

Pattern 2: Batch Processing

Pattern 3: Configuration-Driven

Pattern 4: Quality vs Performance Trade-off

Troubleshooting

Common Issues

1. Incomplete Tokenization

2. Memory Issues

3. Circular Dependency (NuGet)

Performance Comparison

Advanced Configuration

Custom Chunk Sizes

Custom Buffer Sizes

Integration Examples

ASP.NET Core Service

Console Application

Documentation

Getting Help

Contributing

Performance Benchmarks

License

Changelog

Version 1.0.1

net8.0

NuGet packages

GitHub repositories

1. `Tokenizer` - Simple String Tokenization

2. `FlashBertTokenizer` - Standard BERT WordPiece

3. `FlashBertTokenizerOptimized` - High-Performance BERT

4. `FlashBertTokenizerParallel` - Multi-threaded BERT

5. `AsyncTokenizerPipeline` - Async File Processing

6. `FlashBertTokenizerBidirectional` - Robust Fallback

7. `BpeTokenizer` - GPT-2 Style BPE

8. `FlashTokenizer` - Unified Facade