ErgoX.TokenX.HuggingFace.Runtime.iOS 0.22.1

dotnet add package ErgoX.TokenX.HuggingFace.Runtime.iOS --version 0.22.1
                    
NuGet\Install-Package ErgoX.TokenX.HuggingFace.Runtime.iOS -Version 0.22.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" Version="0.22.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" Version="0.22.1" />
                    
Directory.Packages.props
<PackageReference Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ErgoX.TokenX.HuggingFace.Runtime.iOS --version 0.22.1
                    
#r "nuget: ErgoX.TokenX.HuggingFace.Runtime.iOS, 0.22.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ErgoX.TokenX.HuggingFace.Runtime.iOS@0.22.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ErgoX.TokenX.HuggingFace.Runtime.iOS&version=0.22.1
                    
Install as a Cake Addin
#tool nuget:?package=ErgoX.TokenX.HuggingFace.Runtime.iOS&version=0.22.1
                    
Install as a Cake Tool

CodeQL Automatic Dependency Submission Dependabot Updates DevSkim HuggingFace Bridge Artifacts HuggingFace RC Validation

ErgoX TokenX ML NLP Tokenizers

.NET bindings for HuggingFace Tokenizers with comprehensive testing and multi-platform support.

Why ErgoX.TokenX?

TL;DR: Microsoft.ML.Tokenizers requires manual configuration per model. ErgoX.TokenX provides seamless AutoTokenizer.Load() with HuggingFace ecosystem compatibility — 2,500+ Python-.NET tests verified with byte-to-byte Python parity.

The Problem with Microsoft.ML.Tokenizers

While Microsoft.ML.Tokenizers offers exceptional raw performance for GPT models (see benchmarks), working with it reveals significant friction:

🔍 HuggingFace Tokenizer Structure

When you download a HuggingFace tokenizer (AutoTokenizer.from_pretrained in Python), you typically get:

model-name/
├── tokenizer.json          # Serialized tokenizer (model, pre-tokenizer, normalizer, post-processor)
├── tokenizer_config.json   # Metadata (model type, special tokens, casing, padding, truncation)
├── vocab.json              # BPE-based tokenizers (GPT-2, RoBERTa)
├── merges.txt              # BPE merge rules
└── special_tokens_map.json # Maps [CLS], [SEP], [PAD], [BOS], [EOS], etc.

HuggingFace's Python library merges these automatically. In Microsoft.ML.Tokenizers, you must be explicit — manually loading vocabulary files, configuring special tokens, and handling model-specific quirks.

⚠️ Pain Points Found
  1. Missing Special Tokens: Special tokens required by models are not automatically configured. You need manual attention to get it right.
  2. No AutoTokenizer: The AutoTokenizer(...) pattern was missing, while HuggingFace's ecosystem was rapidly growing. .NET lagged behind.
  3. No Chat Templates: Instruction-tuned models (Llama, Mistral, Qwen) require chat templates — outside tokenizer scope in Microsoft.ML, but essential for real-world use.
  4. Limited Pre/Post-Processing: Advanced tokenization pipelines (preprocessing, postprocessing) are difficult to work with.
  5. Complex Overflow Handling: Token overflow scenarios and advanced use cases require significant boilerplate.

The ErgoX.TokenX Solution

Simple approach: HuggingFace's Rust implementation of Tokenizers is ported via C FFI bindings into C#.

What You Get
  • One-Line Loading: AutoTokenizer.Load("model-name") — works like Python
  • 2,500+ Tests Verified: Byte-to-byte parity with Python HuggingFace Transformers
  • SHA256 Hash Verification: Every token output verified against Python reference (2,500+ test cases)
  • Chat Templates: Built-in support for Huggingface chat templates
  • Multi-Modal Support: Whisper (ASR), CLIP (vision), LayoutLM (documents), TrOCR (OCR)
  • Advanced Features:
    • Token offsets for NER/question answering
    • Truncation and padding strategies
    • Attention masks, type IDs, special token handling
    • Pre-tokenization and post-processing pipelines
  • Production-Proven: Internally used since May 2025 without revisiting alternatives
// Microsoft.ML.Tokenizers - Manual configuration required
var vocab = File.ReadAllText("vocab.json");
var merges = File.ReadAllText("merges.txt");
var tokenizer = /* ...manual setup... */;

// ErgoX.TokenX - One line
var tokenizer = AutoTokenizer.Load("bert-base-uncased");
⏱️ When My Observations May Be Outdated

This project was developed internally in May 2025 for key implementations. Microsoft.ML.Tokenizers may have evolved since then, but I never revisited alternatives — ErgoX.TokenX met all my needs.

If you're choosing today:

  • High-throughput GPT-only services → Consider Microsoft.ML (accept manual config)
  • HuggingFace ecosystem compatibility → Use ErgoX.TokenX (for productivity)
  • Multi-modal models (Whisper, CLIP, etc.) → Use ErgoX.TokenX (only option)

Features

Cross-platform - Linux, Windows, macOS (x64 & ARM64)
Extensive test coverage across Linux, Windows, and macOS
Rust FFI bindings - High-performance C bindings layer
CI/CD integration - Automated testing and releases
Test reports - Published with every release
Code coverage - Tracked via Codecov
Sequence decoder combinator - Compose native decoders from .NET

Quick Start

Installation

dotnet add package ErgoX.TokenX.HuggingFace

The package includes pre-built native libraries for all supported platforms (Windows, Linux, macOS x64/ARM64).

Option 2: Manual Installation from Releases

Download the latest release from GitHub Releases:

  • Windows x64: tokenizers-c-win-x64.zip
  • Linux x64: tokenizers-c-linux-x64.tar.gz
  • macOS x64: tokenizers-c-osx-x64.tar.gz
  • macOS ARM64: tokenizers-c-osx-arm64.tar.gz

Extract and place native libraries in your project:

YourProject/
└── runtimes/
    ├── win-x64/native/tokenx_bridge.dll
    ├── linux-x64/native/libtokenx_bridge.so
    └── osx-x64/native/libtokenx_bridge.dylib

Basic Usage

HuggingFace Tokenizers
using ErgoX.TokenX.HuggingFace;

// Load tokenizer automatically (like Python's AutoTokenizer)
using var tokenizer = AutoTokenizer.Load("bert-base-uncased");

// Encode text
var encoding = tokenizer.Tokenizer.Encode("Hello, world!", addSpecialTokens: true);
Console.WriteLine($"Tokens: {string.Join(", ", encoding.Tokens)}");
Console.WriteLine($"IDs: {string.Join(", ", encoding.Ids)}");

// Decode
var decoded = tokenizer.Tokenizer.Decode(encoding.Ids, skipSpecialTokens: true);
Console.WriteLine($"Decoded: {decoded}");

Note: For OpenAI GPT models, consider using Microsoft.ML.Tokenizers which provides TiktokenTokenizer class.

Running Examples

The repository includes ready-to-run examples with pre-configured models:

# HuggingFace comprehensive quickstart (16 examples)
cd examples/HuggingFace/Quickstart
dotnet run

# Other examples (require model downloads)
dotnet run --project examples/HuggingFace/AllMiniLmL6V2Console
dotnet run --project examples/HuggingFace/E5SmallV2Console
dotnet run --project examples/HuggingFace/AutoTokenizerPipelineExplorer

Quickstart Examples Overview

📗 HuggingFace Tokenizer Quickstart

16 comprehensive examples demonstrating:

  • Basic tokenization (WordPiece, Unigram, BPE)
  • Padding and truncation strategies
  • Text pair encoding for classification
  • Attention masks, type IDs, offset mapping
  • Chat template rendering with Llama 3
  • Vocabulary access, special tokens, batch processing

Models included: all-minilm-l6-v2, t5-small, meta-llama-3-8b-instruct

Documentation: Quickstart README | Full Docs

Development

Prerequisites

  • .NET SDK 8.0+
  • Rust 1.70+
  • Visual Studio 2022 (Windows) or equivalent C++ toolchain

Building

# Build Rust library
cd .ext/hf_bridge
cargo build --release

# Copy to .NET runtime folder
Copy-Item target/release/tokenx_bridge.dll ../src/HuggingFace/runtimes/win-x64/native/ -Force

# Build .NET project
cd ..
cd ..
dotnet build --configuration Release

Testing

# Restore sanitized tokenizer fixtures (skips network downloads)
python tests/Py/Common/restore_test_data.py --force

# Run Rust tests
cd .ext/hf_bridge
cargo test --release

# Run .NET tests
dotnet test --configuration Release

# Refresh HuggingFace parity fixtures (requires transformers/tokenizers)
python tests/Py/Huggingface/generate_benchmarks.py

> Ensure the active Python environment includes the `transformers`, `tokenizers`, and `huggingface_hub` packages so the generators can materialize tokenizer pipelines directly from each model asset.

Running the .NET parity suite now also emits dotnet-benchmark.json alongside the Python fixtures in tests/_testdata_huggingface/<model> so you can inspect the full decoded tokens produced by the managed implementation.

Expected Results: 37 passed, 0 skipped, 0 failed

See TESTING-CHECKLIST.md for detailed instructions.

CI/CD

Automated Testing

Every push and pull request triggers:

  1. Rust C Bindings Tests - 20 FFI layer tests on Linux, Windows, macOS
  2. .NET Integration Tests - 185 end-to-end tests on all platforms
  3. Coverage Reports - Uploaded to Codecov

Releases

Create a release by tagging:

git tag c-v0.22.2
git push origin c-v0.22.2

The release workflow will:

  1. ✅ Build binaries for 7 platforms
  2. ✅ Run full test suite (205 tests)
  3. ✅ Package test reports
  4. ✅ Create GitHub Release with:
    • Multi-platform binaries
    • Test reports archive (test-reports.tar.gz)
    • Checksums
    • Release notes with test results

See CI-CD-WORKFLOWS.md for complete documentation.

Test Reports

Every release includes test-reports.tar.gz containing:

  • TRX files - Machine-readable test results
  • HTML reports - Human-readable test results
  • Coverage reports - Code coverage analysis

Download from the Releases page.

Project Structure

TokenX/
├── .ext/
│   └── hf_bridge/                      # HuggingFace native bridge crate (Rust)
├── .github/
│   ├── workflows/                      # CI/CD workflows
│   │   ├── test-c-bindings.yml         # Rust tests + coverage
│   │   ├── test-dotnet.yml             # .NET tests + coverage
│   │   └── release-c-bindings.yml      # Multi-platform release
│   ├── CI-CD-WORKFLOWS.md              # CI/CD documentation
│   └── TESTING-CHECKLIST.md            # Quick reference
├── src/
│   └── HuggingFace/                    # Managed HuggingFace tokenizer bindings and interop
└── tests/
   ├── ErgoX.VecraX.ML.NLP.Tokenizers.HuggingFace.Tests/
   └── ErgoX.VecraX.ML.NLP.Tokenizers.Testing/   # Shared testing infrastructure

Supported Platforms

Platform Architecture Status
Linux x64 ✅ Tested
Windows x64 ✅ Tested
macOS x64 ✅ Tested
macOS ARM64 ✅ Built
iOS ARM64 ✅ Built
Android ARM64 ✅ Built
WebAssembly wasm32 ✅ Built

Note: "Tested" platforms run full .NET test suite in CI. "Built" platforms compile successfully but tests are not run in CI.

Contributing

We welcome contributions. This project maintains strict 1:1 parity with the upstream HuggingFace Tokenizers Rust implementation. Any change that affects tokenization semantics must preserve byte-for-byte parity with the Python/HuggingFace reference unless explicitly documented and reviewed.

Primary contribution patterns

  • Routine tokenizer updates (most common): When the upstream Rust tokenizer is updated but the C FFI surface is unchanged, update the Rust crate version and rebuild the native artifacts:

    1. Fork and create a feature branch.
    2. Bump the tokenizer crate/Cargo dependency and update Cargo.lock/Cargo.toml as needed.
    3. Rebuild the native bridge (.ext/hf_bridge) and copy the outputs to the appropriate runtimes/ folder.
    4. Run the full parity suite and unit tests (see Testing section).
    5. Commit changes and open a PR that references the upstream tokenizer release and lists test results.
  • FFI or ABI changes (infrequent, higher cost): If upstream changes require modifications to the C FFI or introduce new capabilities, you must:

    1. Open an issue describing the required change and proposed approach before implementation.
    2. Fork and create a feature branch.
    3. Update Rust FFI bindings and the managed interop layer. Mirror native structs exactly and add unit tests asserting sizeof/layout parity where possible.
    4. Ensure every GCHandle, pinned buffer, and disposable resource has a verified disposal path.
    5. Add or update end-to-end parity tests, interop smoke tests, and any size/layout assertions.
    6. Update documentation, CHANGELOG, and bump package/native artifact versions.
    7. Submit a PR that includes a clear migration note and the downstream test artifacts demonstrating parity.

Required checks for every PR

  • Run the full test suite locally:
    • Rust tests: cargo test --release
    • .NET tests: dotnet test --configuration Release
    • Parity/fixtures refresh (when applicable): run Python generators only if you are updating parity fixtures and include generated artifacts in the test run.
  • Verify native outputs are placed in runtimes/*/native and referenced artifacts are updated.
  • Ensure Sonar/Analyzer findings are addressed: 0 new security/bug findings, and no build warnings (follow project's coding standards).
  • Add unit tests with >= 80% coverage for new features or changed code paths.
  • Update documentation and any example projects affected.
  • Provide a short PR checklist in the PR description listing the items above and linking to relevant upstream release notes.

Guidelines and best practices

  • Prefer minimal, backward-compatible changes. Maintain byte-for-byte parity unless a breaking change is approved and documented.
  • For large or invasive changes (FFI, memory model, threading), coordinate with maintainers via an issue and include performance/interop tests.
  • Never commit secrets or credentials. Use configuration providers for secrets.
  • Keep changes small and well-tested; include reproducible steps to validate local parity.

Getting help

  • Open an issue for design discussions or if you are unsure whether a change requires FFI edits.
  • Reference the coding standards and acceptance criteria in your PR to speed review.

Please ensure documentation and README sections are updated for any user-visible or behavioral changes introduced by your PR.

License

This project follows the license terms of the HuggingFace Tokenizers library.

Acknowledgments

Built on top of HuggingFace Tokenizers - an incredible fast and versatile tokenization library.


Maintained by: ErgoX TokenX Team
Last Updated: October 17, 2025
Version: 0.22.1

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net8.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.22.1 344 11/1/2025