RecursiveTextSplitter 1.0.2
dotnet add package RecursiveTextSplitter --version 1.0.2
NuGet\Install-Package RecursiveTextSplitter -Version 1.0.2
<PackageReference Include="RecursiveTextSplitter" Version="1.0.2" />
<PackageVersion Include="RecursiveTextSplitter" Version="1.0.2" />
<PackageReference Include="RecursiveTextSplitter" />
paket add RecursiveTextSplitter --version 1.0.2
#r "nuget: RecursiveTextSplitter, 1.0.2"
#:package RecursiveTextSplitter@1.0.2
#addin nuget:?package=RecursiveTextSplitter&version=1.0.2
#tool nuget:?package=RecursiveTextSplitter&version=1.0.2
RecursiveTextSplitter User Guide
Overview
The RecursiveTextSplitter is a C# library that provides intelligent text splitting functionality with semantic awareness. Unlike simple character-based splitting, this library attempts to preserve meaningful boundaries by using a hierarchical approach to text segmentation, from paragraph breaks down to character-level splitting as a last resort.
Key Features
- Semantic Awareness: Maintains natural text boundaries (paragraphs, sentences, words)
- Configurable Overlap: Supports overlapping chunks for better context preservation
- Flexible Separators: Allows custom separator hierarchies or uses intelligent defaults
- Detailed Metadata: Provides comprehensive information about each chunk including position data and line/column tracking
- Word-Safe Overlap: Ensures overlap occurs at natural word boundaries
- Position Tracking: Tracks both character positions and line/column coordinates in the original text
Installation
Via NuGet Package Manager
Install the RecursiveTextSplitter package from NuGet:
dotnet add package RecursiveTextSplitter
Or via Package Manager Console in Visual Studio:
Install-Package RecursiveTextSplitter
Or search for "RecursiveTextSplitter" in the Visual Studio NuGet Package Manager UI.
NuGet Package: https://www.nuget.org/packages/RecursiveTextSplitter/
Usage
Add the namespace to your C# project:
using RecursiveTextSplitting;
Basic Usage
Simple Text Splitting
The most straightforward way to split text is using the RecursiveSplit
extension method:
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}
Advanced Splitting with Metadata
For more detailed information about each chunk, including line and column positions, use the AdvancedRecursiveSplit
method:
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Text}");
Console.WriteLine($"Start Position: {chunk.StartPosition} (Line {chunk.StartLine}, Column {chunk.StartColumn})");
Console.WriteLine($"End Position: {chunk.EndPosition} (Line {chunk.EndLine}, Column {chunk.EndColumn})");
Console.WriteLine($"Separator Used: {chunk.SeparatorUsed}");
Console.WriteLine("---");
}
Working with Overlap
Overlap allows consecutive chunks to share some content, which is particularly useful for maintaining context in applications like search indexing or machine learning.
Basic Overlap Example
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
// Split with 25 characters of overlap
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}
Advanced Overlap with Metadata
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}:");
Console.WriteLine($" Full Text: {chunk.Text}");
Console.WriteLine($" Overlap: '{chunk.OverlapText}'");
Console.WriteLine($" Original Content: '{chunk.ChunkText}'");
Console.WriteLine($" Position: {chunk.StartPosition}-{chunk.EndPosition}");
Console.WriteLine($" Location: Lines {chunk.StartLine}-{chunk.EndLine}");
Console.WriteLine("---");
}
Understanding the TextChunk Class
The TextChunk
class provides comprehensive metadata about each split segment:
public class TextChunk
{
public string Text { get; set; } // Complete text including overlap
public string OverlapText { get; set; } // Only the overlap portion
public string ChunkText { get; set; } // Original chunk without overlap
public int StartPosition { get; set; } // 1-based start position in original text
public int EndPosition { get; set; } // 1-based end position in original text
public string SeparatorUsed { get; set; } // Separator that created this chunk
public int ChunkIndex { get; set; } // Sequential chunk number (1-based)
public int StartColumn { get; set; } // 1-based column where chunk starts
public int StartLine { get; set; } // 1-based line where chunk starts
public int EndColumn { get; set; } // 1-based column where chunk ends
public int EndLine { get; set; } // 1-based line where chunk ends
}
Position Tracking Features
The library now provides detailed position tracking with both character-level and line/column coordinates:
- Character Positions:
StartPosition
andEndPosition
provide 1-based character indices in the original text - Line/Column Tracking:
StartLine
,StartColumn
,EndLine
,EndColumn
provide 1-based line and column coordinates - Comprehensive Coverage: All positions are tracked accurately even when overlap is applied
Custom Separators
You can provide your own separator hierarchy for specialized splitting needs:
string document = "Section 1|Subsection A;Item 1,Item 2|Section 2;Item 3";
// Custom separators prioritizing sections, then subsections, then items
string[] customSeparators = { "|", ";", "," };
var chunks = document.AdvancedRecursiveSplit(
chunkSize: 20,
chunkOverlap: 0,
separators: customSeparators
);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk.Text}");
Console.WriteLine($"Split using: {chunk.SeparatorUsed}");
Console.WriteLine($"At line {chunk.StartLine}, column {chunk.StartColumn}");
Console.WriteLine("---");
}
Separator Hierarchy
The library uses a hierarchical approach to splitting, trying larger semantic units first:
- Paragraph breaks (
\r\n\r\n
,\n\n
) - Largest semantic units - Sentence endings with line breaks (
.\r\n
,!\r\n
,?\r\n
,:\r\n
,;\r\n
) - Single line breaks (
\r\n
) - Sentence endings with newlines (
.\n
,!\n
,?\n
,:\n
,;\n
) - Single newlines (
\n
) - Sentence endings with spaces (
.
,!
,?
) - Punctuation with spaces (
;
,,
) - Word boundaries (
- Character-by-character (
""
) - Last resort
Contributing
We welcome contributions to make RecursiveTextSplitter even better! Here are some ways you can help:
🌟 Star this repository if you find it useful!
Your star helps others discover this library and motivates continued development.
🔧 Pull Requests Welcome
We're open to pull requests! Whether you want to:
- Fix bugs or improve existing functionality
- Add new features or splitting strategies
- Improve documentation or examples
- Optimize performance
- ...
Please feel free to fork the repository and submit a pull request. For larger changes, consider opening an issue first to discuss your approach.
📝 Reporting Issues
Found a bug or have a suggestion? Please open an issue with:
- A clear description of the problem or enhancement
- Steps to reproduce (for bugs)
- Sample code demonstrating the issue
- Expected vs actual behavior
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- No dependencies.
-
net8.0
- No dependencies.
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.