Chonk 0.0.4

dotnet add package Chonk --version 0.0.4
                    
NuGet\Install-Package Chonk -Version 0.0.4
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Chonk" Version="0.0.4" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Chonk" Version="0.0.4" />
                    
Directory.Packages.props
<PackageReference Include="Chonk" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Chonk --version 0.0.4
                    
#r "nuget: Chonk, 0.0.4"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Chonk@0.0.4
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Chonk&version=0.0.4
                    
Install as a Cake Addin
#tool nuget:?package=Chonk&version=0.0.4
                    
Install as a Cake Tool

Chonk

Chonk is a .NET library that makes it easy to split large texts into chunks that try to maintain semantic meaning. This functionality it often used as a preprocessing step when generating vector embeddings of text documents.

Features

Chonk supports:

  • splitting text using a custom list of delimiters
  • static extension methods for splitting English and Markdown text with sane delimiters
  • splitting text using a user-provided custom length function (e.g. a function which tokenizes a string and returns the count of the tokens)
  • associating chunks with the index of the section of the text they came from

Chonk does not yet support:

  • splitting tokenized documents instead of strings
  • creating chunks that overlap each other
  • splitting streams

Chonk is inspired by Langchain's RecursiveTextSplitter and Microsoft Semantic Kernel's TextChunker.

Algorithm

Chonk uses a recursive splitting function to split text documents by an ordered list of delimiters.

  • (Base case) If the length of the text (as measured with string.Length or a user-provided custom function) is less than or equal to the maxChunkSize, the text is returned.
  • if there are no delimiters in the list, it naively splits the text in half, calls itself on each half and returns the concatenated results.
  • If the first delimiter is not present in the text, it calls itself on the text with the rest of the list of delimiters.
  • If the first delimiter is present in the text, it splits the text into two sub-texts, calls itself on each of them and returns the concatenated results.

Guarantees

  • The length of each chunk will be less than or equal to the maxChunkSize
    • When a custom length function is used, the length will be measured using the custom function
    • When no custom lengthFunc is used, the length is measured by Func<string, int> lengthFunc = (text) => text.Length

Getting started

Chonk is available for download as a NuGet package. NuGet Status

dotnet add package Chonk

Documentation

The Chonk class contains utility methods for chunking text into IEnumerable<Chunk>s.

var document =
    "This is the string that we want to split. We want to try to split it into sentences if possible, but this sentence is long.";

var chunks = Chonk.Chunk(document, maxChunkSize: 50).ToList();
foreach (var chunk in chunks)
{
    Console.WriteLine($"{chunk.text} (starts at index {chunk.startingPos})");
}

// This is the string that we want to split. (starts at index 0)
// We want to try to split  (starts at index 42)
// it into sentences if possible, (starts at index 66)
// but this sentence is long. (starts at index 97)
#

You can pass a custom Func<string, int> function for measuring the length of a string:

var document =
    "This is the string that we want to split. We want to try to split it into sentences if possible, but this sentence is long.";

var chunks = Chonk.Chunk(document, maxChunkSize: 30, lengthFunc: str => str.Length / 4).ToList();

foreach (var chunk in chunksCustomLength)
{
    Console.WriteLine($"{chunk.text} (starts at index {chunk.startingPos})");
}

// This is the string that we want to split. (starts at index 0)
// We want to try to split it into sentences if possible, but this sentence is long. (starts at index 42)

All string comparisons (such as when finding delimiters in the text) is done using StringComparison.Ordinal

Benefits

TODO

Benchmarks

Chonk includes code for benchmarking using BenchmarkDotNet.

dotnet run --project Chonk.Benchmark -c Release
Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.0.4 280 8/29/2023
0.0.2 253 8/27/2023
0.0.1 277 8/23/2023