Chonk 0.0.4
dotnet add package Chonk --version 0.0.4
NuGet\Install-Package Chonk -Version 0.0.4
<PackageReference Include="Chonk" Version="0.0.4" />
<PackageVersion Include="Chonk" Version="0.0.4" />
<PackageReference Include="Chonk" />
paket add Chonk --version 0.0.4
#r "nuget: Chonk, 0.0.4"
#:package Chonk@0.0.4
#addin nuget:?package=Chonk&version=0.0.4
#tool nuget:?package=Chonk&version=0.0.4
Chonk
Chonk is a .NET library that makes it easy to split large texts into chunks that try to maintain semantic meaning. This functionality it often used as a preprocessing step when generating vector embeddings of text documents.
Features
Chonk supports:
- splitting text using a custom list of delimiters
- static extension methods for splitting English and Markdown text with sane delimiters
- splitting text using a user-provided custom length function (e.g. a function which tokenizes a string and returns the count of the tokens)
- associating chunks with the index of the section of the text they came from
Chonk does not yet support:
- splitting tokenized documents instead of strings
- creating chunks that overlap each other
- splitting streams
Chonk is inspired by Langchain's RecursiveTextSplitter and Microsoft Semantic Kernel's TextChunker.
Algorithm
Chonk uses a recursive splitting function to split text documents by an ordered list of delimiters.
- (Base case) If the length of the text (as measured with string.Length or a user-provided custom function) is less than or equal to the maxChunkSize, the text is returned.
- if there are no delimiters in the list, it naively splits the text in half, calls itself on each half and returns the concatenated results.
- If the first delimiter is not present in the text, it calls itself on the text with the rest of the list of delimiters.
- If the first delimiter is present in the text, it splits the text into two sub-texts, calls itself on each of them and returns the concatenated results.
Guarantees
- The length of each chunk will be less than or equal to the maxChunkSize
- When a custom length function is used, the length will be measured using the custom function
- When no custom lengthFunc is used, the length is measured by
Func<string, int> lengthFunc = (text) => text.Length
Getting started
Chonk is available for download as a NuGet package.
dotnet add package Chonk
Documentation
The Chonk class contains utility methods for chunking text into IEnumerable<Chunk>s.
var document =
"This is the string that we want to split. We want to try to split it into sentences if possible, but this sentence is long.";
var chunks = Chonk.Chunk(document, maxChunkSize: 50).ToList();
foreach (var chunk in chunks)
{
Console.WriteLine($"{chunk.text} (starts at index {chunk.startingPos})");
}
// This is the string that we want to split. (starts at index 0)
// We want to try to split (starts at index 42)
// it into sentences if possible, (starts at index 66)
// but this sentence is long. (starts at index 97)
#
You can pass a custom Func<string, int> function for measuring the length of a string:
var document =
"This is the string that we want to split. We want to try to split it into sentences if possible, but this sentence is long.";
var chunks = Chonk.Chunk(document, maxChunkSize: 30, lengthFunc: str => str.Length / 4).ToList();
foreach (var chunk in chunksCustomLength)
{
Console.WriteLine($"{chunk.text} (starts at index {chunk.startingPos})");
}
// This is the string that we want to split. (starts at index 0)
// We want to try to split it into sentences if possible, but this sentence is long. (starts at index 42)
All string comparisons (such as when finding delimiters in the text) is done using StringComparison.Ordinal
Benefits
TODO
Benchmarks
Chonk includes code for benchmarking using BenchmarkDotNet.
dotnet run --project Chonk.Benchmark -c Release
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
| .NET Core | netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
| .NET Standard | netstandard2.1 is compatible. |
| MonoAndroid | monoandroid was computed. |
| MonoMac | monomac was computed. |
| MonoTouch | monotouch was computed. |
| Tizen | tizen60 was computed. |
| Xamarin.iOS | xamarinios was computed. |
| Xamarin.Mac | xamarinmac was computed. |
| Xamarin.TVOS | xamarintvos was computed. |
| Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.1
- System.Memory (>= 4.5.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.