MarkdownStructureChunker 1.0.4

.NET 8.0

dotnet add package MarkdownStructureChunker --version 1.0.4

NuGet\Install-Package MarkdownStructureChunker -Version 1.0.4

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="MarkdownStructureChunker" Version="1.0.4" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="MarkdownStructureChunker" Version="1.0.4" />
                    

                            Directory.Packages.props

<PackageReference Include="MarkdownStructureChunker" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add MarkdownStructureChunker --version 1.0.4

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: MarkdownStructureChunker, 1.0.4"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package MarkdownStructureChunker@1.0.4

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=MarkdownStructureChunker&version=1.0.4
                    

                            Install as a Cake Addin

#tool nuget:?package=MarkdownStructureChunker&version=1.0.4
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

MarkdownStructureChunker

A powerful .NET library for intelligent document structure analysis and chunking, designed to extract hierarchical content from various document formats with advanced keyword extraction and vectorization capabilities.

Features

Pattern-Based Structure Recognition: Automatically identifies and parses various document patterns including Markdown headings, numeric outlines, legal sections, and appendices
Hierarchical Content Organization: Maintains parent-child relationships between document sections for contextual understanding
Advanced Keyword Extraction: Supports both simple frequency-based and ML.NET-powered keyword extraction
ONNX Vectorization: Integration with the intfloat/multilingual-e5-large model for semantic embeddings
Extensible Architecture: Plugin-based design allows for custom chunking strategies and extractors
Comprehensive Testing: 66+ unit and integration tests ensuring reliability

Quick Start

Installation

Via NuGet (Recommended)

dotnet add package MarkdownStructureChunker

Via Source Code

# Clone the repository
git clone https://github.com/DevelApp-ai/MarkdownStructureChunker.git
cd MarkdownStructureChunker

# Build the solution
dotnet build

# Run tests
dotnet test

Basic Usage

using MarkdownStructureChunker.Core;
using MarkdownStructureChunker.Core.Extractors;
using MarkdownStructureChunker.Core.Strategies;

// Create chunking strategy and keyword extractor
var strategy = new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules());
var extractor = new SimpleKeywordExtractor();

// Initialize the chunker
var chunker = new StructureChunker(strategy, extractor);

// Process a document
var document = @"
# Introduction
This document introduces machine learning concepts.

## Background
Machine learning is a subset of artificial intelligence.

### Applications
ML has numerous applications in various industries.
";

var result = await chunker.ProcessAsync(document, "ml-guide");

// Access the structured chunks
foreach (var chunk in result.Chunks)
{
    Console.WriteLine($"Level {chunk.Level}: {chunk.CleanTitle}");
    Console.WriteLine($"Keywords: {string.Join(", ", chunk.Keywords)}");
    Console.WriteLine($"Content: {chunk.Content.Substring(0, Math.Min(100, chunk.Content.Length))}...");
    Console.WriteLine();
}

Supported Document Patterns

Markdown Headings

# Level 1 Heading
## Level 2 Heading
### Level 3 Heading
#### Level 4 Heading
##### Level 5 Heading
###### Level 6 Heading

Numeric Outlines

1. First Level
1.1 Second Level
1.1.1 Third Level
1.2 Another Second Level
2. Another First Level

Legal Sections

§ 42 Compliance Requirements
§ 43 Data Protection Standards

Appendices

Appendix A: Technical Specifications
Appendix B: Reference Materials

Letter Outlines

A. First Section
B. Second Section
C. Third Section

Architecture

The library follows a modular architecture with clear separation of concerns:

MarkdownStructureChunker.Core/
├── Models/
│   ├── ChunkNode.cs          # Individual chunk data structure
│   ├── DocumentGraph.cs      # Complete document structure
│   └── ChunkingRule.cs       # Pattern matching rules
├── Interfaces/
│   ├── IChunkingStrategy.cs  # Strategy pattern interface
│   ├── IKeywordExtractor.cs  # Keyword extraction interface
│   └── ILocalVectorizer.cs   # Vectorization interface
├── Strategies/
│   └── PatternBasedStrategy.cs # Default pattern-based implementation
├── Extractors/
│   ├── SimpleKeywordExtractor.cs # Frequency-based extraction
│   └── MLNetKeywordExtractor.cs  # ML.NET-powered extraction
├── Vectorizers/
│   └── OnnxVectorizer.cs     # ONNX model integration
└── StructureChunker.cs       # Main orchestrator class

Advanced Usage

Custom Chunking Rules

// Create custom rules for specific document patterns
var customRules = new List<ChunkingRule>
{
    new ChunkingRule("CustomHeader", @"^SECTION\s+(\d+):\s+(.*)", level: 1, priority: 0),
    new ChunkingRule("Subsection", @"^(\d+\.\d+)\s+(.*)", priority: 10),
    // Add more custom patterns as needed
};

var strategy = new PatternBasedStrategy(customRules);

ML.NET Keyword Extraction

// Use ML.NET for more sophisticated keyword extraction
using var mlExtractor = new MLNetKeywordExtractor();
var chunker = new StructureChunker(strategy, mlExtractor);

var result = await chunker.ProcessAsync(document, "doc-id");

ONNX Vectorization

// Initialize with ONNX model for semantic embeddings
using var vectorizer = OnnxVectorizerFactory.CreateDefault();

// Vectorize chunk content with context
var enrichedContent = OnnxVectorizer.EnrichContentWithContext(
    chunk.Content, 
    GetAncestralTitles(chunk)
);

var embedding = await vectorizer.VectorizeAsync(enrichedContent, isQuery: false);

Configuration

Default Chunking Rules

The library comes with pre-configured rules that handle common document patterns:

Markdown Headings (Priority 0-6): # ## ### #### ##### ######
Numeric Outlines (Priority 10): 1. 1.1 1.1.1 2.3.4.5
Legal Sections (Priority 20): § 42 Section Title
Appendices (Priority 30): Appendix A: Title
Letter Outlines (Priority 40): A. B. C.

Keyword Extraction Options

// Simple extractor with custom parameters
var simpleExtractor = new SimpleKeywordExtractor();
var keywords = await simpleExtractor.ExtractKeywordsAsync(text, maxKeywords: 10);

// ML.NET extractor with advanced processing
using var mlExtractor = new MLNetKeywordExtractor();
var advancedKeywords = await mlExtractor.ExtractKeywordsAsync(text, maxKeywords: 15);

Performance Considerations

Memory Usage: The library processes documents in memory. For very large documents (>10MB), consider chunking the input
ML.NET Performance: First-time initialization of ML.NET components may take 1-2 seconds
ONNX Model Loading: Loading the multilingual-e5-large model requires ~500MB RAM and 2-3 seconds initialization
Concurrent Processing: All components are thread-safe and support concurrent document processing

Integration Examples

ASP.NET Core Web API

[ApiController]
[Route("api/[controller]")]
public class DocumentController : ControllerBase
{
    private readonly StructureChunker _chunker;

    public DocumentController(StructureChunker chunker)
    {
        _chunker = chunker;
    }

    [HttpPost("analyze")]
    public async Task<IActionResult> AnalyzeDocument([FromBody] DocumentRequest request)
    {
        try
        {
            var result = await _chunker.ProcessAsync(request.Content, request.DocumentId);
            return Ok(result);
        }
        catch (Exception ex)
        {
            return BadRequest($"Error processing document: {ex.Message}");
        }
    }
}

Dependency Injection Setup

// Program.cs or Startup.cs
services.AddSingleton<IChunkingStrategy>(provider => 
    new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules()));
services.AddSingleton<IKeywordExtractor, MLNetKeywordExtractor>();
services.AddSingleton<StructureChunker>();

Batch Processing

public async Task ProcessDocumentBatch(IEnumerable<string> documents)
{
    var tasks = documents.Select(async (doc, index) =>
    {
        var result = await chunker.ProcessAsync(doc, $"doc-{index}");
        return result;
    });

    var results = await Task.WhenAll(tasks);
    
    // Process results...
}

Error Handling

The library provides comprehensive error handling:

try
{
    var result = await chunker.ProcessAsync(document, documentId);
}
catch (ArgumentException ex)
{
    // Handle invalid input parameters
    Console.WriteLine($"Invalid input: {ex.Message}");
}
catch (InvalidOperationException ex)
{
    // Handle processing errors
    Console.WriteLine($"Processing error: {ex.Message}");
}
catch (Exception ex)
{
    // Handle unexpected errors
    Console.WriteLine($"Unexpected error: {ex.Message}");
}

Testing

The library includes comprehensive test coverage:

# Run all tests
dotnet test

# Run with coverage
dotnet test --collect:"XPlat Code Coverage"

# Run specific test category
dotnet test --filter Category=Integration

Test categories:

Unit Tests: Individual component testing
Integration Tests: End-to-end workflow testing
Performance Tests: Benchmarking and load testing

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make your changes and add tests
Ensure all tests pass: dotnet test
Commit your changes: git commit -m "Add your feature"
Push to the branch: git push origin feature/your-feature
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

Support for custom ONNX models
Performance optimizations for large documents
Additional language support for keyword extraction

Support

For questions, issues, or contributions, please:

Open an issue on GitHub
Check the documentation
Review the examples

MarkdownStructureChunker - Intelligent document structure analysis for modern applications.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Microsoft.ML (>= 4.0.2)
- Microsoft.ML.OnnxRuntime (>= 1.22.1)
- Microsoft.ML.Tokenizers (>= 1.0.2)
- System.Numerics.Tensors (>= 9.0.7)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.4	174	8/20/2025
1.0.3	137	8/10/2025
1.0.2	208	8/7/2025
1.0.1	204	8/7/2025
1.0.0	213	8/7/2025

Intelligent document structure analysis and chunking library for .NET
     - Pattern-based document structure recognition
     - Hierarchical chunk organization with parent-child relationships
     - Multiple keyword extraction strategies (Simple and ML.NET)
     - ONNX vectorization framework for semantic embeddings
     - Support for Markdown, numeric, legal, and appendix patterns
     - Comprehensive test suite with 66 test cases
     - Production-ready with proper error handling and resource management