DocIngest.Core
0.0.6
dotnet add package DocIngest.Core --version 0.0.6
NuGet\Install-Package DocIngest.Core -Version 0.0.6
<PackageReference Include="DocIngest.Core" Version="0.0.6" />
<PackageVersion Include="DocIngest.Core" Version="0.0.6" />
<PackageReference Include="DocIngest.Core" />
paket add DocIngest.Core --version 0.0.6
#r "nuget: DocIngest.Core, 0.0.6"
#:package DocIngest.Core@0.0.6
#addin nuget:?package=DocIngest.Core&version=0.0.6
#tool nuget:?package=DocIngest.Core&version=0.0.6
DocIngest
DocIngest is a flexible, modular .NET library designed for automated document ingestion, processing, and delivery. It transforms unstructured documents from various sources—such as drop locations, scanned images, PDFs, and other files—into structured, metadata-enriched outputs. Leveraging AI for intelligent categorization, it enables seamless deployment to systems like CRM (Customer Relationship Management) or DMS (Document Management Systems), relating documents to the appropriate entities based on extracted insights.
Core Idea
The fundamental concept behind DocIngest is to streamline document workflows in enterprise environments. Imagine a scenario where documents arrive in a designated "drop location" (e.g., a network folder with subfolders representing individual documents). These could include:
- Scanned images (multi-page invoices or contracts)
- PDFs, Word documents, or text files
- Mixed file types within a single logical document
DocIngest processes these inputs by:
- Traversal: Scanning the drop location and grouping files into logical documents.
- Processing: Extracting text via OCR for images, combining multi-page documents, and generating unified outputs (e.g., Word or PDF files).
- AI Categorization: Using AI to analyze text and assign categories, tags, and insights (e.g., "Invoice", "Vendor: ABC Corp", "Amount: $500").
- Delivery: Organizing and delivering processed documents to destinations like local folders, CRM systems, or DMS platforms, ensuring they relate to the correct entities (e.g., customer records or project folders).
This pipeline reduces manual effort, minimizes errors, and enables intelligent routing—perfect for automating document-heavy processes in finance, legal, or administrative domains.
Getting Started
Prerequisites
- .NET 10.0 SDK
- Tesseract OCR (download from GitHub)
- Optional: OpenAI API key for AI categorization
Installation
Install the NuGet package: Install-Package DocIngest.Core
For samples, clone the repository and navigate to the sample folder.
Quick Start
Run the InvoiceOrganizer sample: Place invoice files in input/invoices/, run dotnet run in samples/InvoiceOrganizer/, view organized output in output/.
Sample Application
The InvoiceOrganizer sample demonstrates organizing invoices by date. It processes subfolders in input/invoices/ as logical documents, extracts text via OCR, combines multi-page files, parses dates, and delivers organized outputs to output/ (e.g., 2023/01/). See samples/InvoiceOrganizer/Program.cs and samples/InvoiceOrganizer/appsettings.json for implementation.
Flexibility and Extensibility
DocIngest's architecture emphasizes flexibility, allowing you to adapt it to diverse workflows without rewriting core logic. Built on a pipeline pattern with dependency injection (DI), it supports:
Service Replacement: Core services are defined by interfaces, enabling easy swaps. For example:
- Replace the OCR service (default: Tesseract) with Azure Computer Vision or Google Cloud Vision.
- Swap the document generator (default: DocX/PDFsharp) for custom formats or integrations.
- Change the delivery service to push to CRM APIs, DMS systems, or cloud storage (e.g., Azure Blob Storage).
Custom Middleware: The pipeline is composed of middlewares implementing
IPipelineMiddleware. Add, remove, or reorder steps for custom logic:- Insert validation middleware to check document integrity.
- Add encryption or watermarking for security.
- Implement custom AI prompts for domain-specific categorization (e.g., legal clauses).
Pipeline Customization: Use
PipelineBuilderto completely redefine the pipeline. For instance, skip AI categorization for simple workflows or add parallel processing for high-volume scenarios.Configuration and Context: Runtime settings (e.g., output formats, organization criteria) are configurable via
PipelineContext. This shared dictionary allows middlewares to communicate and customize behavior dynamically.
This modularity ensures DocIngest scales from simple file organization to complex, AI-driven enterprise integrations.
NuGet Package Support
DocIngest is distributed as a NuGet package (DocIngest.Core, version 1.0.0), making it easy to integrate into .NET projects. Key features:
- Dependencies: Includes essential libraries like Microsoft.Extensions.AI for AI integration, Tesseract for OCR, DocX/itext7/PDFsharp for document generation, and SixLabors.ImageSharp for image handling.
- Target Framework: Built for .NET 10.0, leveraging modern C# features (nullable references, implicit usings).
- Symbols and Debugging: Includes symbol packages (.snupkg) for enhanced debugging in IDEs like Visual Studio.
- Installation: Add via NuGet Package Manager:
Install-Package DocIngest.Core.
The package supports both development and production use, with tests (xUnit, Moq) ensuring reliability. Samples are included in the repository for hands-on learning.
Usage Examples
Basic Pipeline Setup
using DocIngest.Core;
// Configure services
var ocrService = new TesseractOcrService();
var documentGenerator = new DefaultDocumentGenerator();
var deliveryService = new FolderDeliveryService();
// Build pipeline (without AI for simplicity)
var pipeline = new PipelineBuilder()
.UseMiddleware(new DocumentTraversalMiddleware())
.UseMiddleware(new DocumentProcessingMiddleware(ocrService, documentGenerator))
.UseMiddleware(new DateParsingMiddleware())
.UseMiddleware(new DeliveryMiddleware(deliveryService))
.Build();
// Execute
var context = new PipelineContext
{
["InputPath"] = "input/documents",
["OutputPath"] = "output",
["OrganizationPathFunc"] = (Func<OutputDocument, string>)(doc =>
doc.Tags.FirstOrDefault(t => t.StartsWith("date/"))?.Replace("date/", "") ?? "unknown")
};
await pipeline(context);
Custom Middleware Example
Implement a custom middleware for document validation:
public class ValidationMiddleware : IPipelineMiddleware
{
public async Task InvokeAsync(PipelineContext context, PipelineDelegate next)
{
var documents = context.Get<List<Document>>("Documents");
foreach (var doc in documents)
{
if (doc.Files.Count == 0)
throw new InvalidOperationException("Document has no files.");
}
await next(context);
}
}
// Add to pipeline
var pipeline = new PipelineBuilder()
.UseMiddleware(new ValidationMiddleware())
// ... other middlewares
.Build();
Replacing a Service
Swap the OCR service for a custom implementation:
public class CustomOcrService : IOcrService
{
public async Task<string> ExtractTextAsync(byte[] imageBytes)
{
// Custom OCR logic, e.g., call Azure AI
return "Extracted text...";
}
}
// Inject custom service
var ocrService = new CustomOcrService();
// Use in DocumentProcessingMiddleware
Running the Sample Pipeline
Based on InvoiceOrganizer, configure services and run:
// Configure services (simplified from sample's DI)
var ocrService = new TesseractOcrService();
var documentGenerator = new DefaultDocumentGenerator();
var deliveryService = new FolderDeliveryService();
// Build pipeline
var pipeline = new PipelineBuilder()
.UseMiddleware(new DocumentTraversalMiddleware())
.UseMiddleware(new DocumentProcessingMiddleware(ocrService, documentGenerator))
.UseMiddleware(new DateParsingMiddleware())
.UseMiddleware(new DeliveryMiddleware(deliveryService))
.Build();
// Execute (similar to sample)
var context = new PipelineContext
{
["InputPath"] = "input/invoices",
["OutputPath"] = "output",
["OrganizationPathFunc"] = (Func<OutputDocument, string>)(doc =>
Path.Combine(doc.Tags.FirstOrDefault(t => t.StartsWith("date/"))?.Replace("date/", "") ?? "unknown", doc.Category))
};
await pipeline(context);
Samples
The samples/ folder provides example applications demonstrating DocIngest use cases. Clone the repository to explore and modify them.
Invoice Organization
The InvoiceOrganizer sample processes invoices from a drop folder, using OCR to extract text, parsing dates for organization, and delivering to structured folders. It shows extensibility by allowing service swaps (e.g., replace Tesseract with Azure OCR) and middleware additions (e.g., custom date parsing). Run with dotnet run in samples/InvoiceOrganizer/.
Future Samples
Planned samples include contract management (demonstrating AI categorization for legal clauses) and receipt aggregation (showing multi-page handling and custom organization by vendor/amount).
Getting Started with Samples
After installing DocIngest.Core via NuGet, explore samples to understand integrations. Each sample includes setup instructions in its folder.
Configuration
Configure DocIngest via appsettings.json or environment variables.
- appsettings.json: Set
ImageProcessing:OutputFormat(e.g., "Word", "PDF"). - Environment Variables:
OPENAI_API_KEYfor AI features. - PipelineContext: Use keys like
OrganizationPathFuncfor custom organization logic.
Architecture Overview
- Data Models:
Document(logical document),OutputDocument(processed file with metadata),FileMetadata(file details). - Pipeline: Delegate-based chain of middlewares, executed in reverse order.
PipelineContextis a shared dictionary for passing data (e.g., documents list, output paths) between middlewares. - Services: Interfaces (
IOcrService,IDocumentGenerator,IDeliveryService) for pluggability. - AI Integration: Uses
IChatClientfor categorization prompts, returning JSON with category, tags, and insights.
Building and Testing
- Build: Run
dotnet buildin the root or specific project folders. - Test: Run
dotnet testin DocIngest.Tests/ to execute unit tests. - Package: Run
dotnet packto create NuGet packages.
Contributing
Contributions are welcome! Extend middlewares, add services, or improve AI prompts. Add new samples for additional document types to expand use cases. Ensure tests pass and follow the modular design.
Troubleshooting
- Tesseract Not Found: Ensure Tesseract is installed and in PATH.
- AI API Errors: Check
OPENAI_API_KEYor AI client configuration. - Font Issues in Tests: Install required fonts for PDF generation.
- Enable AI in samples by uncommenting
AiCategorizationMiddlewareand setting the API key.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Azure.Storage.Blobs (>= 12.27.0)
- DocX (>= 5.0.0)
- Microsoft.Extensions.AI (>= 10.2.0)
- Microsoft.Extensions.Configuration.Abstractions (>= 10.0.2)
- Microsoft.Extensions.Configuration.Binder (>= 10.0.2)
- Microsoft.Extensions.Logging (>= 10.0.2)
- PdfPig (>= 0.1.7)
- PDFsharp (>= 6.2.4)
- SixLabors.ImageSharp (>= 3.1.12)
- Tesseract (>= 5.2.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.