Microsoft.ML.Tokenizers
0.22.0-preview.24179.1
Prefix Reserved
dotnet add package Microsoft.ML.Tokenizers --version 0.22.0-preview.24179.1
NuGet\Install-Package Microsoft.ML.Tokenizers -Version 0.22.0-preview.24179.1
<PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" />
paket add Microsoft.ML.Tokenizers --version 0.22.0-preview.24179.1
#r "nuget: Microsoft.ML.Tokenizers, 0.22.0-preview.24179.1"
// Install Microsoft.ML.Tokenizers as a Cake Addin
#addin nuget:?package=Microsoft.ML.Tokenizers&version=0.22.0-preview.24179.1&prerelease
// Install Microsoft.ML.Tokenizers as a Cake Tool
#tool nuget:?package=Microsoft.ML.Tokenizers&version=0.22.0-preview.24179.1&prerelease
About
Microsoft.ML.Tokenizers supports various the implementation of the tokenization used in the NLP transforms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
- BPE - Byte pair encoding model
- English Roberta model
- Tiktoken model
- Llama model
How to Use
using Microsoft.ML.Tokenizers;
using System.Net.Http;
using System.IO;
//
// Using Tiktoken Tokenizer
//
// initialize the tokenizer for `gpt-4` model
Tokenizer tokenizer = Tokenizer.CreateTiktokenForModel("gpt-4");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// print: Tokens: 16
var trimIndex = tokenizer.LastIndexOfTokenCount(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// 5 tokens from end: a list of tokens.
trimIndex = tokenizer.IndexOfTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open stream of remote Llama tokenizer model data file
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream
Tokenizer llamaTokenizer = Tokenizer.CreateLlama(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// print: Tokens: 5
Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.Tokenizer
Microsoft.ML.Tokenizers.Bpe
Microsoft.ML.Tokenizers.EnglishRoberta
Microsoft.ML.Tokenizers.Tiktoken
Microsoft.ML.Tokenizers.TokenizerDecoder
Microsoft.ML.Tokenizers.Normalizer
Microsoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Related Packages
Feedback & Contributing
Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Google.Protobuf (>= 3.19.6)
- System.Text.Json (>= 6.0.1)
-
net8.0
- Google.Protobuf (>= 3.19.6)
- System.Text.Json (>= 6.0.1)
NuGet packages (3)
Showing the top 3 NuGet packages that depend on Microsoft.ML.Tokenizers:
Package | Downloads |
---|---|
Microsoft.ML.TorchSharp
Microsoft.ML.TorchSharp contains ML.NET integration of TorchSharp. |
|
Microsoft.Teams.AI
SDK focused on building AI based applications for Microsoft Teams. |
|
Microsoft.KernelMemory.AI.TikToken
Provide TikToken tokenizers in Kernel Memory |
GitHub repositories (2)
Showing the top 2 popular GitHub repositories that depend on Microsoft.ML.Tokenizers:
Repository | Stars |
---|---|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
|
lindexi/lindexi_gd
博客用到的代码
|
Version | Downloads | Last updated |
---|---|---|
0.22.0-preview.24179.1 | 5,309 | 4/2/2024 |
0.22.0-preview.24162.2 | 13,384 | 3/13/2024 |
0.21.1 | 39,507 | 1/18/2024 |
0.21.0 | 43,469 | 11/27/2023 |
0.21.0-preview.23511.1 | 49,898 | 10/13/2023 |
0.21.0-preview.23266.6 | 48,142 | 5/17/2023 |
0.21.0-preview.22621.2 | 1,929 | 12/22/2022 |
0.20.1 | 66,603 | 2/1/2023 |
0.20.1-preview.22573.9 | 1,988 | 11/24/2022 |
0.20.0 | 25,513 | 11/8/2022 |
0.20.0-preview.22551.1 | 200 | 11/1/2022 |