ErgoX.TokenX.HuggingFace.Runtime.iOS 0.22.1

.NET 8.0

dotnet add package ErgoX.TokenX.HuggingFace.Runtime.iOS --version 0.22.1

NuGet\Install-Package ErgoX.TokenX.HuggingFace.Runtime.iOS -Version 0.22.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" Version="0.22.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" Version="0.22.1" />
                    

                            Directory.Packages.props

<PackageReference Include="ErgoX.TokenX.HuggingFace.Runtime.iOS" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ErgoX.TokenX.HuggingFace.Runtime.iOS --version 0.22.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ErgoX.TokenX.HuggingFace.Runtime.iOS, 0.22.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ErgoX.TokenX.HuggingFace.Runtime.iOS@0.22.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ErgoX.TokenX.HuggingFace.Runtime.iOS&version=0.22.1
                    

                            Install as a Cake Addin

#tool nuget:?package=ErgoX.TokenX.HuggingFace.Runtime.iOS&version=0.22.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ErgoX TokenX ML NLP Tokenizers

.NET bindings for HuggingFace Tokenizers with comprehensive testing and multi-platform support.

Why ErgoX.TokenX?

TL;DR: Microsoft.ML.Tokenizers requires manual configuration per model. ErgoX.TokenX provides seamless AutoTokenizer.Load() with HuggingFace ecosystem compatibility — 2,500+ Python-.NET tests verified with byte-to-byte Python parity.

The Problem with Microsoft.ML.Tokenizers

While Microsoft.ML.Tokenizers offers exceptional raw performance for GPT models (see benchmarks), working with it reveals significant friction:

🔍 HuggingFace Tokenizer Structure

When you download a HuggingFace tokenizer (AutoTokenizer.from_pretrained in Python), you typically get:

model-name/
├── tokenizer.json          # Serialized tokenizer (model, pre-tokenizer, normalizer, post-processor)
├── tokenizer_config.json   # Metadata (model type, special tokens, casing, padding, truncation)
├── vocab.json              # BPE-based tokenizers (GPT-2, RoBERTa)
├── merges.txt              # BPE merge rules
└── special_tokens_map.json # Maps [CLS], [SEP], [PAD], [BOS], [EOS], etc.

HuggingFace's Python library merges these automatically. In Microsoft.ML.Tokenizers, you must be explicit — manually loading vocabulary files, configuring special tokens, and handling model-specific quirks.

⚠️ Pain Points Found

Missing Special Tokens: Special tokens required by models are not automatically configured. You need manual attention to get it right.
No AutoTokenizer: The AutoTokenizer(...) pattern was missing, while HuggingFace's ecosystem was rapidly growing. .NET lagged behind.
No Chat Templates: Instruction-tuned models (Llama, Mistral, Qwen) require chat templates — outside tokenizer scope in Microsoft.ML, but essential for real-world use.
Limited Pre/Post-Processing: Advanced tokenization pipelines (preprocessing, postprocessing) are difficult to work with.
Complex Overflow Handling: Token overflow scenarios and advanced use cases require significant boilerplate.

The ErgoX.TokenX Solution

Simple approach: HuggingFace's Rust implementation of Tokenizers is ported via C FFI bindings into C#.

✅ What You Get

One-Line Loading: AutoTokenizer.Load("model-name") — works like Python
2,500+ Tests Verified: Byte-to-byte parity with Python HuggingFace Transformers
SHA256 Hash Verification: Every token output verified against Python reference (2,500+ test cases)
Chat Templates: Built-in support for Huggingface chat templates
Multi-Modal Support: Whisper (ASR), CLIP (vision), LayoutLM (documents), TrOCR (OCR)
Advanced Features:
- Token offsets for NER/question answering
- Truncation and padding strategies
- Attention masks, type IDs, special token handling
- Pre-tokenization and post-processing pipelines
Production-Proven: Internally used since May 2025 without revisiting alternatives

// Microsoft.ML.Tokenizers - Manual configuration required
var vocab = File.ReadAllText("vocab.json");
var merges = File.ReadAllText("merges.txt");
var tokenizer = /* ...manual setup... */;

// ErgoX.TokenX - One line
var tokenizer = AutoTokenizer.Load("bert-base-uncased");

⏱️ When My Observations May Be Outdated

This project was developed internally in May 2025 for key implementations. Microsoft.ML.Tokenizers may have evolved since then, but I never revisited alternatives — ErgoX.TokenX met all my needs.

If you're choosing today:

✅ High-throughput GPT-only services → Consider Microsoft.ML (accept manual config)
✅ HuggingFace ecosystem compatibility → Use ErgoX.TokenX (for productivity)
✅ Multi-modal models (Whisper, CLIP, etc.) → Use ErgoX.TokenX (only option)

Features

✅ Cross-platform - Linux, Windows, macOS (x64 & ARM64)
✅ Extensive test coverage across Linux, Windows, and macOS
✅ Rust FFI bindings - High-performance C bindings layer
✅ CI/CD integration - Automated testing and releases
✅ Test reports - Published with every release
✅ Code coverage - Tracked via Codecov
✅ Sequence decoder combinator - Compose native decoders from .NET

Quick Start

Installation

Option 1: NuGet Package (Recommended)

dotnet add package ErgoX.TokenX.HuggingFace

The package includes pre-built native libraries for all supported platforms (Windows, Linux, macOS x64/ARM64).

Option 2: Manual Installation from Releases

Download the latest release from GitHub Releases:

Windows x64: tokenizers-c-win-x64.zip
Linux x64: tokenizers-c-linux-x64.tar.gz
macOS x64: tokenizers-c-osx-x64.tar.gz
macOS ARM64: tokenizers-c-osx-arm64.tar.gz

Extract and place native libraries in your project:

YourProject/
└── runtimes/
    ├── win-x64/native/tokenx_bridge.dll
    ├── linux-x64/native/libtokenx_bridge.so
    └── osx-x64/native/libtokenx_bridge.dylib

Basic Usage

HuggingFace Tokenizers

using ErgoX.TokenX.HuggingFace;

// Load tokenizer automatically (like Python's AutoTokenizer)
using var tokenizer = AutoTokenizer.Load("bert-base-uncased");

// Encode text
var encoding = tokenizer.Tokenizer.Encode("Hello, world!", addSpecialTokens: true);
Console.WriteLine($"Tokens: {string.Join(", ", encoding.Tokens)}");
Console.WriteLine($"IDs: {string.Join(", ", encoding.Ids)}");

// Decode
var decoded = tokenizer.Tokenizer.Decode(encoding.Ids, skipSpecialTokens: true);
Console.WriteLine($"Decoded: {decoded}");

Note: For OpenAI GPT models, consider using Microsoft.ML.Tokenizers which provides TiktokenTokenizer class.

Running Examples

The repository includes ready-to-run examples with pre-configured models:

# HuggingFace comprehensive quickstart (16 examples)
cd examples/HuggingFace/Quickstart
dotnet run

# Other examples (require model downloads)
dotnet run --project examples/HuggingFace/AllMiniLmL6V2Console
dotnet run --project examples/HuggingFace/E5SmallV2Console
dotnet run --project examples/HuggingFace/AutoTokenizerPipelineExplorer

Quickstart Examples Overview

📗 HuggingFace Tokenizer Quickstart

16 comprehensive examples demonstrating:

Basic tokenization (WordPiece, Unigram, BPE)
Padding and truncation strategies
Text pair encoding for classification
Attention masks, type IDs, offset mapping
Chat template rendering with Llama 3
Vocabulary access, special tokens, batch processing

Models included: all-minilm-l6-v2, t5-small, meta-llama-3-8b-instruct

Documentation: Quickstart README | Full Docs

Development

Prerequisites

.NET SDK 8.0+
Rust 1.70+
Visual Studio 2022 (Windows) or equivalent C++ toolchain

Building

# Build Rust library
cd .ext/hf_bridge
cargo build --release

# Copy to .NET runtime folder
Copy-Item target/release/tokenx_bridge.dll ../src/HuggingFace/runtimes/win-x64/native/ -Force

# Build .NET project
cd ..
cd ..
dotnet build --configuration Release

Testing

# Restore sanitized tokenizer fixtures (skips network downloads)
python tests/Py/Common/restore_test_data.py --force

# Run Rust tests
cd .ext/hf_bridge
cargo test --release

# Run .NET tests
dotnet test --configuration Release

# Refresh HuggingFace parity fixtures (requires transformers/tokenizers)
python tests/Py/Huggingface/generate_benchmarks.py

> Ensure the active Python environment includes the `transformers`, `tokenizers`, and `huggingface_hub` packages so the generators can materialize tokenizer pipelines directly from each model asset.

Running the .NET parity suite now also emits dotnet-benchmark.json alongside the Python fixtures in tests/_testdata_huggingface/<model> so you can inspect the full decoded tokens produced by the managed implementation.

Expected Results: 37 passed, 0 skipped, 0 failed

See TESTING-CHECKLIST.md for detailed instructions.

CI/CD

Automated Testing

Every push and pull request triggers:

Rust C Bindings Tests - 20 FFI layer tests on Linux, Windows, macOS
.NET Integration Tests - 185 end-to-end tests on all platforms
Coverage Reports - Uploaded to Codecov

Releases

Create a release by tagging:

git tag c-v0.22.2
git push origin c-v0.22.2

The release workflow will:

✅ Build binaries for 7 platforms
✅ Run full test suite (205 tests)
✅ Package test reports
✅ Create GitHub Release with:
- Multi-platform binaries
- Test reports archive (test-reports.tar.gz)
- Checksums
- Release notes with test results

See CI-CD-WORKFLOWS.md for complete documentation.

Test Reports

Every release includes test-reports.tar.gz containing:

TRX files - Machine-readable test results
HTML reports - Human-readable test results
Coverage reports - Code coverage analysis

Download from the Releases page.

Project Structure

TokenX/
├── .ext/
│   └── hf_bridge/                      # HuggingFace native bridge crate (Rust)
├── .github/
│   ├── workflows/                      # CI/CD workflows
│   │   ├── test-c-bindings.yml         # Rust tests + coverage
│   │   ├── test-dotnet.yml             # .NET tests + coverage
│   │   └── release-c-bindings.yml      # Multi-platform release
│   ├── CI-CD-WORKFLOWS.md              # CI/CD documentation
│   └── TESTING-CHECKLIST.md            # Quick reference
├── src/
│   └── HuggingFace/                    # Managed HuggingFace tokenizer bindings and interop
└── tests/
   ├── ErgoX.VecraX.ML.NLP.Tokenizers.HuggingFace.Tests/
   └── ErgoX.VecraX.ML.NLP.Tokenizers.Testing/   # Shared testing infrastructure

Supported Platforms

Platform	Architecture	Status
Linux	x64	✅ Tested
Windows	x64	✅ Tested
macOS	x64	✅ Tested
macOS	ARM64	✅ Built
iOS	ARM64	✅ Built
Android	ARM64	✅ Built
WebAssembly	wasm32	✅ Built

Note: "Tested" platforms run full .NET test suite in CI. "Built" platforms compile successfully but tests are not run in CI.

Contributing

We welcome contributions. This project maintains strict 1:1 parity with the upstream HuggingFace Tokenizers Rust implementation. Any change that affects tokenization semantics must preserve byte-for-byte parity with the Python/HuggingFace reference unless explicitly documented and reviewed.

Primary contribution patterns

Routine tokenizer updates (most common): When the upstream Rust tokenizer is updated but the C FFI surface is unchanged, update the Rust crate version and rebuild the native artifacts:
1. Fork and create a feature branch.
2. Bump the tokenizer crate/Cargo dependency and update Cargo.lock/Cargo.toml as needed.
3. Rebuild the native bridge (.ext/hf_bridge) and copy the outputs to the appropriate runtimes/ folder.
4. Run the full parity suite and unit tests (see Testing section).
5. Commit changes and open a PR that references the upstream tokenizer release and lists test results.
FFI or ABI changes (infrequent, higher cost): If upstream changes require modifications to the C FFI or introduce new capabilities, you must:
1. Open an issue describing the required change and proposed approach before implementation.
2. Fork and create a feature branch.
3. Update Rust FFI bindings and the managed interop layer. Mirror native structs exactly and add unit tests asserting sizeof/layout parity where possible.
4. Ensure every GCHandle, pinned buffer, and disposable resource has a verified disposal path.
5. Add or update end-to-end parity tests, interop smoke tests, and any size/layout assertions.
6. Update documentation, CHANGELOG, and bump package/native artifact versions.
7. Submit a PR that includes a clear migration note and the downstream test artifacts demonstrating parity.

Required checks for every PR

Run the full test suite locally:
- Rust tests: cargo test --release
- .NET tests: dotnet test --configuration Release
- Parity/fixtures refresh (when applicable): run Python generators only if you are updating parity fixtures and include generated artifacts in the test run.
Verify native outputs are placed in runtimes/*/native and referenced artifacts are updated.
Ensure Sonar/Analyzer findings are addressed: 0 new security/bug findings, and no build warnings (follow project's coding standards).
Add unit tests with >= 80% coverage for new features or changed code paths.
Update documentation and any example projects affected.
Provide a short PR checklist in the PR description listing the items above and linking to relevant upstream release notes.

Guidelines and best practices

Prefer minimal, backward-compatible changes. Maintain byte-for-byte parity unless a breaking change is approved and documented.
For large or invasive changes (FFI, memory model, threading), coordinate with maintainers via an issue and include performance/interop tests.
Never commit secrets or credentials. Use configuration providers for secrets.
Keep changes small and well-tested; include reproducible steps to validate local parity.

Getting help

Open an issue for design discussions or if you are unsure whether a change requires FFI edits.
Reference the coding standards and acceptance criteria in your PR to speed review.

Please ensure documentation and README sections are updated for any user-visible or behavioral changes introduced by your PR.

License

This project follows the license terms of the HuggingFace Tokenizers library.

Acknowledgments

Built on top of HuggingFace Tokenizers - an incredible fast and versatile tokenization library.

Maintained by: ErgoX TokenX Team
Last Updated: October 17, 2025
Version: 0.22.1

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.22.1	344	11/1/2025