MarkdownStructureChunker 1.0.3
dotnet add package MarkdownStructureChunker --version 1.0.3
NuGet\Install-Package MarkdownStructureChunker -Version 1.0.3
<PackageReference Include="MarkdownStructureChunker" Version="1.0.3" />
<PackageVersion Include="MarkdownStructureChunker" Version="1.0.3" />
<PackageReference Include="MarkdownStructureChunker" />
paket add MarkdownStructureChunker --version 1.0.3
#r "nuget: MarkdownStructureChunker, 1.0.3"
#:package MarkdownStructureChunker@1.0.3
#addin nuget:?package=MarkdownStructureChunker&version=1.0.3
#tool nuget:?package=MarkdownStructureChunker&version=1.0.3
MarkdownStructureChunker
A powerful .NET library for intelligent document structure analysis and chunking, designed to extract hierarchical content from various document formats with advanced keyword extraction and vectorization capabilities.
Features
- Pattern-Based Structure Recognition: Automatically identifies and parses various document patterns including Markdown headings, numeric outlines, legal sections, and appendices
- Hierarchical Content Organization: Maintains parent-child relationships between document sections for contextual understanding
- Advanced Keyword Extraction: Supports both simple frequency-based and ML.NET-powered keyword extraction
- ONNX Vectorization: Integration with the intfloat/multilingual-e5-large model for semantic embeddings
- Extensible Architecture: Plugin-based design allows for custom chunking strategies and extractors
- Comprehensive Testing: 66+ unit and integration tests ensuring reliability
Quick Start
Installation
Via NuGet (Recommended)
dotnet add package MarkdownStructureChunker
Via Source Code
# Clone the repository
git clone https://github.com/DevelApp-ai/MarkdownStructureChunker.git
cd MarkdownStructureChunker
# Build the solution
dotnet build
# Run tests
dotnet test
Basic Usage
using MarkdownStructureChunker.Core;
using MarkdownStructureChunker.Core.Extractors;
using MarkdownStructureChunker.Core.Strategies;
// Create chunking strategy and keyword extractor
var strategy = new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules());
var extractor = new SimpleKeywordExtractor();
// Initialize the chunker
var chunker = new StructureChunker(strategy, extractor);
// Process a document
var document = @"
# Introduction
This document introduces machine learning concepts.
## Background
Machine learning is a subset of artificial intelligence.
### Applications
ML has numerous applications in various industries.
";
var result = await chunker.ProcessAsync(document, "ml-guide");
// Access the structured chunks
foreach (var chunk in result.Chunks)
{
Console.WriteLine($"Level {chunk.Level}: {chunk.CleanTitle}");
Console.WriteLine($"Keywords: {string.Join(", ", chunk.Keywords)}");
Console.WriteLine($"Content: {chunk.Content.Substring(0, Math.Min(100, chunk.Content.Length))}...");
Console.WriteLine();
}
Supported Document Patterns
Markdown Headings
# Level 1 Heading
## Level 2 Heading
### Level 3 Heading
#### Level 4 Heading
##### Level 5 Heading
###### Level 6 Heading
Numeric Outlines
1. First Level
1.1 Second Level
1.1.1 Third Level
1.2 Another Second Level
2. Another First Level
Legal Sections
§ 42 Compliance Requirements
§ 43 Data Protection Standards
Appendices
Appendix A: Technical Specifications
Appendix B: Reference Materials
Letter Outlines
A. First Section
B. Second Section
C. Third Section
Architecture
The library follows a modular architecture with clear separation of concerns:
MarkdownStructureChunker.Core/
├── Models/
│ ├── ChunkNode.cs # Individual chunk data structure
│ ├── DocumentGraph.cs # Complete document structure
│ └── ChunkingRule.cs # Pattern matching rules
├── Interfaces/
│ ├── IChunkingStrategy.cs # Strategy pattern interface
│ ├── IKeywordExtractor.cs # Keyword extraction interface
│ └── ILocalVectorizer.cs # Vectorization interface
├── Strategies/
│ └── PatternBasedStrategy.cs # Default pattern-based implementation
├── Extractors/
│ ├── SimpleKeywordExtractor.cs # Frequency-based extraction
│ └── MLNetKeywordExtractor.cs # ML.NET-powered extraction
├── Vectorizers/
│ └── OnnxVectorizer.cs # ONNX model integration
└── StructureChunker.cs # Main orchestrator class
Advanced Usage
Custom Chunking Rules
// Create custom rules for specific document patterns
var customRules = new List<ChunkingRule>
{
new ChunkingRule("CustomHeader", @"^SECTION\s+(\d+):\s+(.*)", level: 1, priority: 0),
new ChunkingRule("Subsection", @"^(\d+\.\d+)\s+(.*)", priority: 10),
// Add more custom patterns as needed
};
var strategy = new PatternBasedStrategy(customRules);
ML.NET Keyword Extraction
// Use ML.NET for more sophisticated keyword extraction
using var mlExtractor = new MLNetKeywordExtractor();
var chunker = new StructureChunker(strategy, mlExtractor);
var result = await chunker.ProcessAsync(document, "doc-id");
ONNX Vectorization
// Initialize with ONNX model for semantic embeddings
using var vectorizer = OnnxVectorizerFactory.CreateDefault();
// Vectorize chunk content with context
var enrichedContent = OnnxVectorizer.EnrichContentWithContext(
chunk.Content,
GetAncestralTitles(chunk)
);
var embedding = await vectorizer.VectorizeAsync(enrichedContent, isQuery: false);
Configuration
Default Chunking Rules
The library comes with pre-configured rules that handle common document patterns:
- Markdown Headings (Priority 0-6):
# ## ### #### ##### ######
- Numeric Outlines (Priority 10):
1. 1.1 1.1.1 2.3.4.5
- Legal Sections (Priority 20):
§ 42 Section Title
- Appendices (Priority 30):
Appendix A: Title
- Letter Outlines (Priority 40):
A. B. C.
Keyword Extraction Options
// Simple extractor with custom parameters
var simpleExtractor = new SimpleKeywordExtractor();
var keywords = await simpleExtractor.ExtractKeywordsAsync(text, maxKeywords: 10);
// ML.NET extractor with advanced processing
using var mlExtractor = new MLNetKeywordExtractor();
var advancedKeywords = await mlExtractor.ExtractKeywordsAsync(text, maxKeywords: 15);
Performance Considerations
- Memory Usage: The library processes documents in memory. For very large documents (>10MB), consider chunking the input
- ML.NET Performance: First-time initialization of ML.NET components may take 1-2 seconds
- ONNX Model Loading: Loading the multilingual-e5-large model requires ~500MB RAM and 2-3 seconds initialization
- Concurrent Processing: All components are thread-safe and support concurrent document processing
Integration Examples
ASP.NET Core Web API
[ApiController]
[Route("api/[controller]")]
public class DocumentController : ControllerBase
{
private readonly StructureChunker _chunker;
public DocumentController(StructureChunker chunker)
{
_chunker = chunker;
}
[HttpPost("analyze")]
public async Task<IActionResult> AnalyzeDocument([FromBody] DocumentRequest request)
{
try
{
var result = await _chunker.ProcessAsync(request.Content, request.DocumentId);
return Ok(result);
}
catch (Exception ex)
{
return BadRequest($"Error processing document: {ex.Message}");
}
}
}
Dependency Injection Setup
// Program.cs or Startup.cs
services.AddSingleton<IChunkingStrategy>(provider =>
new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules()));
services.AddSingleton<IKeywordExtractor, MLNetKeywordExtractor>();
services.AddSingleton<StructureChunker>();
Batch Processing
public async Task ProcessDocumentBatch(IEnumerable<string> documents)
{
var tasks = documents.Select(async (doc, index) =>
{
var result = await chunker.ProcessAsync(doc, $"doc-{index}");
return result;
});
var results = await Task.WhenAll(tasks);
// Process results...
}
Error Handling
The library provides comprehensive error handling:
try
{
var result = await chunker.ProcessAsync(document, documentId);
}
catch (ArgumentException ex)
{
// Handle invalid input parameters
Console.WriteLine($"Invalid input: {ex.Message}");
}
catch (InvalidOperationException ex)
{
// Handle processing errors
Console.WriteLine($"Processing error: {ex.Message}");
}
catch (Exception ex)
{
// Handle unexpected errors
Console.WriteLine($"Unexpected error: {ex.Message}");
}
Testing
The library includes comprehensive test coverage:
# Run all tests
dotnet test
# Run with coverage
dotnet test --collect:"XPlat Code Coverage"
# Run specific test category
dotnet test --filter Category=Integration
Test categories:
- Unit Tests: Individual component testing
- Integration Tests: End-to-end workflow testing
- Performance Tests: Benchmarking and load testing
Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature
- Make your changes and add tests
- Ensure all tests pass:
dotnet test
- Commit your changes:
git commit -m "Add your feature"
- Push to the branch:
git push origin feature/your-feature
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Roadmap
- Support for custom ONNX models
- Performance optimizations for large documents
- Additional language support for keyword extraction
Support
For questions, issues, or contributions, please:
- Open an issue on GitHub
- Check the documentation
- Review the examples
MarkdownStructureChunker - Intelligent document structure analysis for modern applications.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Microsoft.ML (>= 4.0.2)
- Microsoft.ML.OnnxRuntime (>= 1.22.1)
- Microsoft.ML.Tokenizers (>= 1.0.2)
- System.Numerics.Tensors (>= 9.0.7)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Intelligent document structure analysis and chunking library for .NET
- Pattern-based document structure recognition
- Hierarchical chunk organization with parent-child relationships
- Multiple keyword extraction strategies (Simple and ML.NET)
- ONNX vectorization framework for semantic embeddings
- Support for Markdown, numeric, legal, and appendix patterns
- Comprehensive test suite with 66 test cases
- Production-ready with proper error handling and resource management