Docuoria.Cli 1.0.15

dotnet tool install --global Docuoria.Cli --version 1.0.15
                    
This package contains a .NET tool you can call from the shell/command line.
dotnet new tool-manifest
                    
if you are setting up this repo
dotnet tool install --local Docuoria.Cli --version 1.0.15
                    
This package contains a .NET tool you can call from the shell/command line.
#tool dotnet:?package=Docuoria.Cli&version=1.0.15
                    
nuke :add-package Docuoria.Cli --version 1.0.15
                    

Docuoria

Template-driven structured data extraction for PDFs. You bring the PDF and a template; the engine matches, extracts, transforms, and renders structured output.


What this library is

PDF Pipeline is a stateless, in-process .NET 10 library built around a single mental model:

  • Three nouns — Match Rules, Templates, Output Generators
  • Two verbs — Evaluate Match Rule and Execute Template

Match rules decide whether a template applies to a given PDF. A template binds a class of PDFs to a structured output: a root match rule, a data schema, and an ordered pipeline (extraction → optional transforms/retrievals → publish). Output generators render the sealed result into a concrete format.

The engine is deterministic by design: same PDF + same template + same generator options ⇒ same bytes out, every time. No mutable per-call state, no hidden coordination, no surprises in concurrent callers.

The product is layered. This repository delivers Layer 1 — the engine itself:

Layer 3 — Transports & Clients    REST API · .NET SDK · Node.js SDK (future)
Layer 2 — Service                 Accounts · Template Store · Submission lifecycle (future)
Layer 1 — Engine ◄── this library Stateless primitives · Match Rules · Steps · Output Generators

Quickstart

Install the package:

dotnet add package Docuoria

Wire the engine, build a one-field template, and extract — copy-paste runnable:

using Microsoft.Extensions.DependencyInjection;
using Docuoria.Configuration;
using Docuoria.Contracts;
using Docuoria.MatchRules;
using Docuoria.Models;
using Docuoria.Output.Csv;
using Docuoria.Pipeline.Extraction;
using Docuoria.Pipeline.Publish;
using Docuoria.Registration;
using Docuoria.Results;

var services = new ServiceCollection();
services.AddDocuoriaEngine(b => b.AddBuiltInMatchRules().AddCsvOutputGenerator());
var engine = services.BuildServiceProvider().GetRequiredService<IDocuoriaEngine>();

var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
    new PrimitiveFieldDefinition("vendor", FieldType.String, isRequired: true),
});
var template = TemplateBuilder.Create("quickstart", new DataModel(schema))
    .WithMatchRule<FileNameMatchRule, FileNameMatchRuleConfiguration>(
        new FileNameMatchRuleConfiguration { Pattern = "**/*.pdf", Threshold = 1m })
    .ExtractWith<ExtractionStep, ExtractionStepConfiguration>(new ExtractionStepConfiguration(new IFieldMapping[]
    {
        new FieldMapping("vendor", FieldType.String, MetadataFieldExtractionSource.Standard(MetadataField.Title)),
    }))
    .PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
    .Build();

await using var pdf = File.OpenRead("invoice.pdf");
var result = await engine.ExecuteTemplateAsync<CsvOutputGenerator, CsvGeneratorOptions>(pdf, template, new CsvGeneratorOptions());
if (result is SucceededResult ok)
    Console.WriteLine(System.Text.Encoding.UTF8.GetString(ok.Output.Payload.Span));

The walkthrough above is verified by ReadmeWalkthroughTests in the SDK test suite, so it cannot silently drift from the live engine surface.


Inspecting a PDF

Template authors often need to see what the engine sees before committing to a match rule or template. The v1.4 InspectAsync API returns a read-only projection of a PDF — page count, info-dictionary metadata, per-page flattened text, raw text blocks with bounds, and capped table previews — without evaluating any rules or templates.

using Microsoft.Extensions.DependencyInjection;
using Docuoria.Configuration;
using Docuoria.Contracts;
using Docuoria.Models;
using Docuoria.Registration;

var services = new ServiceCollection();
services.AddDocuoriaEngine(builder => builder.AddBuiltInMatchRules());
var provider = services.BuildServiceProvider();
var engine = provider.GetRequiredService<IDocuoriaEngine>();

using var pdf = File.OpenRead("invoice.pdf");

// Inspect only page 1; cap each table preview at 5 body rows.
var inspection = await engine.InspectAsync(
    pdf,
    pageFilter: PageFilter.SinglePage(1),
    options: new InspectOptions { MaxTablePreviewRows = 5 });

Console.WriteLine($"Pages: {inspection.PageCount}, Author: {inspection.Metadata.Author}");
foreach (var page in inspection.Pages)
{
    Console.WriteLine($"--- Page {page.PageNumber} ---");
    Console.WriteLine(page.FlattenedText);
    foreach (var table in page.Tables)
        Console.WriteLine($"Table: {table.TotalRowCount} rows total, header = [{string.Join(", ", table.HeaderPreview)}]");
}

InspectAsync is read-only and never throws on adversarial PDF input — unparseable files return an empty result (PageCount = 0).

The companion TestPatternAsync and TestGroupsAsync APIs run a regex against the same flattened haystack and return structured match / gap / per-group diagnostics — useful for iterating on extraction patterns before wiring them into a template.


VS Code Copilot Skill

An LLM agent working in this repo can load the docuoria skill to learn the canonical extraction workflow, the extraction-source decision tree, an illustrative regex library, the failure-mode decision tree, and the local-processing privacy guarantee.

The skill follows the agentskills.io standard and is built as a portable AI plugin packageSKILL.md + references/ + scripts/ + assets/lib/ (bundled SDK DLL) + assets/schemas/ + examples/ + MANIFEST.json (per-file SHA-256). The same package shape is consumed in-repo by Copilot and is the unit we distribute to downstream agents.

  • Source: skills/docuoria/ (flat markdown), scripts/*.csx, src/libs/Docuoria/.
  • Build the package: ./skills/build.ps1 builds the SDK in Release, assembles dist/docuoria/ per the agentskills.io layout, rewrites in-package path references (_common.csx #r, example cross-doc links), generates MANIFEST.json, and mirrors the result to .github/skills/docuoria/ for in-repo Copilot.
  • Drift check: ./skills/build.ps1 -Check re-hashes dist/docuoria/ against its MANIFEST.json and exits non-zero on drift. CI gate.

The skill is documentation + thin CLI — it does not change engine or host behaviour.

Installing the skill into your agent

The Docuoria AI skill ships as both an npm global package and a .NET global tool. Both produce an identical payload — choose whichever runtime is already in your environment.

npm (Node.js ≥ 20):

npm install -g @sidub/docuoria
docuoria init

.NET global tool:

dotnet tool install -g Docuoria.Cli
docuoria init

Both commands launch an interactive tool-picker showing all supported AI tools with their detection status. Select the tools you want, press Enter, and the skill files are scaffolded into the correct directories.

Non-interactive / scripted use:

# Install for specific tools
docuoria init --tools claude,cursor

# Install for all tools
docuoria init --tools all

# Re-apply after a CLI update
docuoria update

# See supported tools and their status
docuoria list-tools

# Check for drift
docuoria doctor

The installer is idempotent — files whose SHA-256 already matches are skipped, and locally modified files are not overwritten unless you pass --force.


Getting started

Prerequisites

  • .NET 10 SDK or later
  • Python 3.12 (only if you intend to use the Python step)

Build & test

dotnet build src/libs/Docuoria/Docuoria.csproj
dotnet test  tests/Docuoria.Tests/Docuoria.Tests.csproj

Register the engine

The engine is configured once at composition time. The default convenience helpers register all seven built‑in match rules and the output generators in a few lines:

using Docuoria.Registration;
using Docuoria.Contracts;

services.AddDocuoriaEngine(builder =>
{
    builder
        .AddBuiltInMatchRules()      // FileName, Metadata, TextPattern, TextAnchor,
                                     // PageGeometry, Table, Composite
        .AddCsvOutputGenerator()     // CsvOutputGenerator + CsvGeneratorOptions
        .AddJsonOutputGenerator();   // JsonOutputGenerator + JsonGeneratorOptions
});

If you only need a subset, drop down to the typed primitives:

services.AddDocuoriaEngine(builder =>
{
    builder.AddMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>();
    builder.AddMatchRule<CompositeMatchRule,   CompositeMatchRuleConfiguration>();
    builder.AddOutputGenerator<CsvOutputGenerator, CsvGeneratorOptions>();
    builder.AddRetrievalProvider<HttpRetrievalProvider, HttpRetrievalProviderConfiguration>();
});

IDocuoriaEngine is registered as a singleton and is safe to resolve and invoke from any number of concurrent callers.


A complete, end‑to‑end example

The goal: take a PDF invoice, decide whether it looks like an invoice we recognize, extract three fields, validate them against a schema, and render the result as CSV.

1. Describe the output shape

A DataModel is the schema vocabulary the publish step enforces — every output instance must conform to it.

using Docuoria.Models;

var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
    new PrimitiveFieldDefinition("vendor", FieldType.String, isRequired: true),
    new PrimitiveFieldDefinition("total",  FieldType.String, isRequired: true),
    new PrimitiveFieldDefinition("date",   FieldType.Date,   isRequired: false),
});

var dataModel = new DataModel(schema);

2. Build the template

Everything below — match rule, extraction, optional transforms, publish — is assembled by TemplateBuilder. Each verb constrains its TStep/TConfig pair at compile time, and Build() rejects missing pieces before execution time:

using Docuoria.Configuration;
using Docuoria.MatchRules;
using Docuoria.Models;
using Docuoria.Pipeline.Extraction;
using Docuoria.Pipeline.Publish;
using Docuoria.Pipeline.Transformation;

var extractionConfig = new ExtractionStepConfiguration(new IFieldMapping[]
{
    new FieldMapping("vendor", FieldType.String,
        TextAnchorExtractionSource.Token(
            region: new PdfBounds(x: 0,   y: 0,  width: 300, height: 100),
            token:  "Vendor:")),

    new FieldMapping("total",  FieldType.String,
        TextPatternExtractionSource.Pattern(@"\$[\d,]+\.\d{2}")),

    new FieldMapping("date",   FieldType.Date,
        MetadataFieldExtractionSource.Standard(MetadataField.CreationDate),
        parseFormat: "MM/dd/yyyy"),  // locale-aware coercion (optional)
});

var transformConfig = new TransformationStepConfiguration(new FieldTransform[]
{
    new TrimTransform("vendor"),
    new CastTransform("total",  FieldType.Number),
    new FormatTransform("date", "yyyy-MM-dd"),
});

var matchRuleConfig = new TextPatternMatchRuleConfiguration
{
    Tokens    = new[] { "INVOICE", "Amount Due" },
    Mode      = TextMatchMode.AllTokens,
    Threshold = 0.8m,
};

var template = TemplateBuilder.Create("invoice-v1", dataModel)
    .WithMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(matchRuleConfig)
    .ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
    .ThenTransform<TransformationStep, TransformationStepConfiguration>(transformConfig)
    .PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
    .Build();

Template's constructor is internal; the builder is the only way in. That is by design — it lets the type system catch the "wrong config for this step" mistake at the call site, not at runtime.

3. Decide whether the template applies

EvaluateMatchRuleAsync is the cheap probe. It returns a confidence in [0, 1] and a final decision (confidence ≥ threshold):

using Docuoria.Contracts;

var engine = serviceProvider.GetRequiredService<IDocuoriaEngine>();

await using var probeStream = File.OpenRead("invoice.pdf");
var match = await engine.EvaluateMatchRuleAsync(
    probeStream,
    template.RootMatchRule,
    fileName: "invoice.pdf");

if (!match.IsMatch)
{
    Console.WriteLine($"Skipping — confidence {match.Confidence:P0} below threshold.");
    return;
}

4. Execute the template

ExecuteTemplateAsync<TGenerator, TOptions> runs the full pipeline and renders the output. The generator is chosen at compile time via the generic type parameter; the constraint where TGenerator : IOutputGenerator<TOptions> guarantees the options object you pass is exactly the shape the generator expects:

using Docuoria.Output.Csv;
using Docuoria.Results;
using System.Text;

await using var pdfStream = File.OpenRead("invoice.pdf");

ProcessingResult result = await engine.ExecuteTemplateAsync<CsvOutputGenerator, CsvGeneratorOptions>(
    pdfStream,
    template,
    new CsvGeneratorOptions { Delimiter = ',' });

switch (result)
{
    case SucceededResult success:
        var csv = Encoding.UTF8.GetString(success.Output.Payload.Span);
        Console.WriteLine(csv);                        // rendered output
        Console.WriteLine(success.Output.ContentType); // e.g. "text/csv"
        break;

    case FailedResult failure:
        // A step threw mid-pipeline.
        Console.Error.WriteLine($"Step '{failure.StepIdentifier}' failed: {failure.ErrorMessage}");
        break;

    case RejectedResult rejection:
        // The engine refused to run the request. Reason values:
        //   InvalidPdf · MalformedTemplate · UnknownOutputGenerator · GeneratorRejected
        Console.Error.WriteLine($"Rejected: {rejection.Reason} — {rejection.Detail}");
        break;
}

That is the whole loop: register, build a template, evaluate, execute, switch on the discriminated result.


Extracting repeating data

Many PDFs contain line items — table rows or repeating regex patterns — that don't fit a fixed‑field schema. Collection extraction handles this: declare a collection field in the data model, point a collection source at the repeating structure, and map sub‑fields within each element.

1. Describe the output shape with a collection field

var lineItemSchema = new RecordDefinition("LineItem", new FieldDefinition[]
{
    new PrimitiveFieldDefinition("product",   FieldType.String,  isRequired: true),
    new PrimitiveFieldDefinition("unitPrice", FieldType.Number,  isRequired: false),
    new PrimitiveFieldDefinition("quantity",  FieldType.Integer, isRequired: false),
});

var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
    new PrimitiveFieldDefinition("invoiceNumber", FieldType.String, isRequired: true),
    new RecordFieldDefinition("lineItems", lineItemSchema, isCollection: true),
});

var dataModel = new DataModel(schema);

2. Build the template with repeating field mappings

Scalar fields use regular FieldMappings. The collection field gets a RepeatingFieldMapping that pairs a collection source with sub‑field mappings:

var extractionConfig = new ExtractionStepConfiguration(new IFieldMapping[]
{
    new FieldMapping("invoiceNumber", FieldType.String,
        TextPatternExtractionSource.Pattern(@"Invoice No[\s\S]*?(\d{10})")),
});

var repeating = new RepeatingFieldMapping(
    collectionFieldName: "lineItems",
    elementDefinition:   lineItemSchema,
    subFields: new SubFieldMapping[]
    {
        new HeaderSubFieldMapping("product",   FieldType.String,  "Product"),
        new HeaderSubFieldMapping("unitPrice", FieldType.Number,  "Unit Price"),
        new HeaderSubFieldMapping("quantity",  FieldType.Integer, "Qty"),
    },
    source: TableRowsExtractionSource.ByHeader());

var template = TemplateBuilder.Create("invoice-with-lines", dataModel)
    .WithMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(matchRuleConfig)
    .ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
    .WithRepeatingMapping(repeating)
    .PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
    .Build();

Build() validates each repeating mapping against the DataModel at template construction time: the target field must exist, must be a collection, and its element record shape must match the mapping's elementDefinition — including nested records compared by name and field structure.

3. Execute — CSV flattens one row per element

The rest of the loop is unchanged — EvaluateMatchRuleAsync, then ExecuteTemplateAsync. The CSV generator flattens the result: one output row per line item, with the scalar invoiceNumber repeated on every row:

invoiceNumber,lineItems.product,lineItems.unitPrice,lineItems.quantity
6297020453,ThinkPad X1,1299.00,2
6297020453,USB-C Dock,189.00,1
6297020453,Warranty 3Y,149.00,2

Empty collections produce a header row with zero data rows.

The same pattern works with regex‑based collection sources. To extract repeating text patterns instead of table rows, swap the source and sub‑field type:

var repeating = new RepeatingFieldMapping(
    collectionFieldName: "matches",
    elementDefinition:   matchSchema,
    subFields: new SubFieldMapping[]
    {
        new NamedGroupSubFieldMapping("code", FieldType.String, groupName: "code"),
        new NamedGroupSubFieldMapping("amount", FieldType.Number, groupName: "amt"),
    },
    source: TextPatternExtractionSource.AllMatches(
        @"(?<code>\w{10})\s+\$(?<amt>\d+\.\d{2})",
        startAnchor: "LINE ITEMS",
        endAnchor: "SUBTOTAL"));

NamedGroupSubFieldMapping addresses capture groups by name — more readable and resilient to pattern edits than ordinal indices. AllMatches accepts optional startAnchor/endAnchor sentinels to restrict matching to a region of the document (the text between the first occurrence of each).

When a primary extraction source might not match every PDF variant, wrap it with a fallback chain:

new FieldMapping("total", FieldType.Number,
    new FallbackExtractionSource(
        primary:  TextPatternExtractionSource.Pattern(@"Grand Total:\s*\$(\d+\.\d{2})"),
        fallback: TextPatternExtractionSource.Pattern(@"Amount Due:\s*\$(\d+\.\d{2})")));

FallbackExtractionSource tries the primary source first; if it returns null, the fallback is attempted. Fallbacks compose to arbitrary depth.


Composing match rules

Real‑world templates rarely match on a single signal. The builder exposes a nested composite sub‑builder so you can compose AND/OR/NOT trees without leaving the fluent surface:

using Docuoria.MatchRules;

var template = TemplateBuilder.Create("invoice-v2", dataModel)
    .WithCompositeMatchRule(CompositeOperator.And, root => root
        .Add<FileNameMatchRule, FileNameMatchRuleConfiguration>(
            new FileNameMatchRuleConfiguration
            {
                Pattern   = "**/invoices/**/*.pdf",
                Mode      = PatternMode.Glob,
                Threshold = 1m,
            })
        .AddComposite(CompositeOperator.Or, anyOf => anyOf
            .Add<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
                new TextPatternMatchRuleConfiguration
                {
                    Tokens = new[] { "INVOICE" }, Mode = TextMatchMode.AnyToken, Threshold = 1m,
                })
            .Add<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
                new TextPatternMatchRuleConfiguration
                {
                    Tokens = new[] { "Amount Due" }, Mode = TextMatchMode.AnyToken, Threshold = 1m,
                })))
    .ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
    .PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
    .Build();

Add<TRule, TConfig>(config, weight) adds a leaf rule with an optional aggregation weight; AddComposite(op, configure, weight) adds a nested grouping. The composite rule validates that Not has exactly one child at build time.

If you need to construct a reference outside the builder (for example to probe an ad‑hoc rule with EvaluateMatchRuleAsync), use the matching public factory:

using Docuoria.MatchRules;

IMatchRuleReference rule = MatchRuleReference.Create<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
    new TextPatternMatchRuleConfiguration { Tokens = new[] new[] { "INVOICE" }, Threshold = 0.9m });

The same pattern exists for steps and retrieval providers via StepReference.Extraction<>(), StepReference.Transformation<>(), StepReference.Retrieval<>(), StepReference.Publish<>(), and StepReference.RetrievalProvider<>(). The builder uses these factories internally — you only reach for them directly when you need a reference outside the template assembly path.


Reference

The engine surface

The entire public contract is two methods on IDocuoriaEngine:

ValueTask<MatchResult> EvaluateMatchRuleAsync(
    Stream pdfStream,
    IMatchRuleReference ruleReference,
    string? fileName = null,
    CancellationToken cancellationToken = default);

ValueTask<ProcessingResult> ExecuteTemplateAsync<TGenerator, TOptions>(
    Stream pdfStream,
    Template template,
    TOptions options,
    bool diagnostics = false,
    CancellationToken cancellationToken = default)
    where TGenerator : IOutputGenerator<TOptions>
    where TOptions   : IGeneratorOptions;

Notes on the contract:

  • The engine opens and disposes the stream you pass; you retain ownership of the underlying file or buffer.
  • Execute returns one of SucceededResult, FailedResult, or RejectedResult for domain outcomes. Null/empty arguments throw ArgumentNullException/ArgumentException — those are programmer errors, not results.
  • RejectedResult carries an optional Detail string with a human‑readable explanation when the rejection originates from an output generator.
  • Cancellation propagates as OperationCanceledException; it is never folded into a rejected result.
  • RejectionReason.UnknownOutputGenerator is returned only when TGenerator was never registered. The options‑mismatch case is impossible by construction — the generic constraints forbid it.
  • Pass diagnostics: true to attach an ExtractionDiagnostics snapshot to the result (see Extraction diagnostics below). Zero overhead when disabled.

Built‑in match rules

Rule What it scores Typical use
FileNameMatchRule Glob or substring match on the supplied file name Folder‑driven routing
MetadataMatchRule Standard metadata fields (author, title, subject, keywords) Vendor‑authored PDFs
TextPatternMatchRule Token / regex hits across all pages Document‑class detection
TextAnchorMatchRule Text presence and location relative to an anchor Form‑style layouts
PageGeometryMatchRule Page count, dimensions, orientation, aspect ratio Letter vs A4, statements vs reports
TableMatchRule Table structure (rows, columns, cell content) Tabular reports
CompositeMatchRule Aggregates child rules under AND / OR / NOT Layered matching

Every leaf rule produces a confidence in [0, 1] and accepts an author‑supplied threshold; the rule fires when confidence ≥ threshold. Composites combine child confidences (weighted average for And, weighted max for Or, 1 − child for Not).

Pipeline steps

Every template has exactly one extraction step at the start and one publish step at the end. Between them, you can chain zero or more intermediate steps in declared order.

Step Position Role
ExtractionStep Anchor (start) Seeds the initial DataRecord from the PDF via extraction sources
TransformationStep Intermediate Applies a declared sequence of field transforms
RetrievalStep Intermediate Calls a registered IRetrievalProvider<TConfig> to enrich the record
PublishStep Anchor (end) Validates against the DataModel and seals the output

A Python step is planned alongside the existing intermediate kinds; the pythonnet dependency is already in place.

Extraction sources

Extraction sources are the where — they tell the extraction step where in the PDF each field comes from. Sources come in two flavors: scalar sources that produce a single value per field, and collection sources that produce one record per row or match.

Scalar sources

Each scalar source exposes typed static factories so the call site reads like prose:

Source Factories
TextPatternExtractionSource .Token(token, pageNumber?, caseSensitive?, blockSeparator?), .Pattern(regex, pageNumber?, caseSensitive?, matchTimeout?, blockSeparator?)
TextAnchorExtractionSource .Token(region, token, pageNumber?, caseSensitive?), .Pattern(region, regex, pageNumber?, caseSensitive?)
MetadataFieldExtractionSource .Standard(MetadataField), .Raw(rawKey)
TableCellExtractionSource .Ordinal(rowIndex, columnIndex, tableIndex?, pageNumber?), .ByHeader(rowIndex, headerToken, …)
FallbackExtractionSource new FallbackExtractionSource(primary, fallback) — composable try/else chain

Anchor regions use PdfBounds(x, y, width, height) in PDF points (top‑left origin, 1/72 inch units, rotation‑normalized).

matchTimeout guards against ReDoS on untrusted regex patterns (defaults to infinite — v1.1 behavior preserved). blockSeparator controls how text blocks are joined into the search haystack (defaults to "\n").

Collection sources

Collection sources iterate a repeating structure in the PDF and yield one record per element. They are paired with a RepeatingFieldMapping that describes the sub‑field addressing within each element:

Source Factories Sub‑field type
TableRowsExtractionSource .ByHeader(tableIndex?, pageNumber?, headerRowIndex?, caseSensitiveHeader?), .Ordinal(tableIndex?, pageNumber?, skipRows?) HeaderSubFieldMapping or OrdinalSubFieldMapping
TextPatternExtractionSource .AllMatches(regex, pageNumber?, caseSensitive?, matchTimeout?, blockSeparator?, startAnchor?, endAnchor?) RegexGroupSubFieldMapping or NamedGroupSubFieldMapping

TableRowsExtractionSource.ByHeader resolves sub‑field names against a header row; Ordinal addresses columns by zero‑based index. AllMatches produces one record per non‑overlapping regex match, with capture groups projected through sub‑field mappings. Optional startAnchor/endAnchor sentinels restrict matching to the text region between the first occurrence of each.

Sub‑field mappings describe how to locate each field within a single element:

Sub‑field mapping Addressing
HeaderSubFieldMapping(fieldName, fieldType, headerToken, caseSensitive?) Table column by header text
OrdinalSubFieldMapping(fieldName, fieldType, columnIndex) Table column by zero‑based index
RegexGroupSubFieldMapping(fieldName, fieldType, groupIndex) Regex capture group by index (≥ 1)
NamedGroupSubFieldMapping(fieldName, fieldType, groupName) Regex capture group by name

Transforms

Transforms are the how — each one rewrites one or more fields in declared order. They run inside a TransformationStep configured with an ordered FieldTransform[]:

Transform Example
TrimTransform new TrimTransform("vendor")
CastTransform new CastTransform("total", FieldType.Number)
FormatTransform new FormatTransform("date", "yyyy-MM-dd")
RenameTransform new RenameTransform("amt", "amount")
ComputeTransform new ComputeTransform("tax", ComputeOperator.Multiply, new[] { "subtotal", "rate" })
CollectionElementTransform new CollectionElementTransform("lineItems", new FieldTransform[] { new TrimTransform("product") })

CollectionElementTransform applies a sequence of inner transforms to each element's record within a named collection field — useful for trimming, casting, or renaming fields inside repeating data without writing per‑element boilerplate.

Data model

The schema vocabulary used by the publish step:

  • Primitive typesString, Number, Integer, Boolean, Date, Timestamp
  • Records — composite named‑field values via RecordDefinition; nestable through RecordFieldDefinition
  • Collections — ordered, repeated values (isCollection: true on any field definition)
  • Optionality — each field declares isRequired; enforced by PublishStep

Output generators

An output generator renders a sealed DataModelInstance into a concrete format:

Generator Options Content‑type Collection handling
CsvOutputGenerator CsvGeneratorOptions { Delimiter } text/csv Flattened: one row per element, scalars repeated. Multiple independent collections rejected.
JsonOutputGenerator JsonGeneratorOptions { Indented, OmitNulls } application/json Natural: arrays at any depth, recursive record nesting. No single‑collection restriction.

Both plug in through AddOutputGenerator<TGen, TOptions> and consume the same generic ExecuteTemplateAsync<,> overload. XML is on the roadmap.


Extraction diagnostics

Template authoring is iterative: you need to see what the engine "sees" before you can write correct field mappings. Pass diagnostics: true to get a zero‑allocation‑when‑disabled snapshot of the extraction internals:

using Docuoria.Diagnostics;

var result = await engine.ExecuteTemplateAsync<JsonOutputGenerator, JsonGeneratorOptions>(
    pdfStream, template, new JsonGeneratorOptions(),
    diagnostics: true);

if (result is SucceededResult success && success.Diagnostics is { } diag)
{
    // The flattened text haystack the engine matched against:
    Console.WriteLine(diag.Haystack);

    // Per-mapping trace — did each field match? Where?
    foreach (var trace in diag.MappingTraces)
    {
        Console.WriteLine($"{trace.FieldName}: matched={trace.Matched}, text={trace.MatchedText}");
        if (trace.MatchIndex is not null)
            Console.WriteLine($"  offset={trace.MatchIndex} len={trace.MatchLength}");
        if (trace.NamedGroups is { Count: > 0 } groups)
            Console.WriteLine($"  groups={string.Join(", ", groups.Select(g => $"{g.Key}={g.Value}"))}");
    }

    // Raw block inventory with bounding boxes (PDF points):
    foreach (var block in diag.Blocks)
        Console.WriteLine($"  p{block.PageNumber}: [{block.X},{block.Y} {block.Width}×{block.Height}] {block.Content}");
}

You can also inspect the engine's text haystack directly without executing a template:

var haystack = TextSearch.ExtractText(pdfDocument);

TextSearch lives in Docuoria.Diagnostics and accepts optional pageNumber and blockSeparator parameters.


Dry-run for debugging

DryRunAsync executes a template's extraction + intermediate stages against a PDF and returns the projected record without running the publish step. Use it for template authoring, integration smoke tests, and field-level failure diagnosis — no output sink is required.

using var pdf = File.OpenRead("invoice.pdf");
var result = await engine.DryRunAsync(pdf, template, new DryRunOptions
{
    Diagnostics = true,           // collect MappingTrace per field (default true)
    IncludeRawHaystack = false,   // include extracted PDF text (opt-in; can be large)
    PageFilter = null,            // optional page subset
});

switch (result)
{
    case DryRunSucceeded ok:
        // ok.JsonProjection: IReadOnlyDictionary<string, object?>
        // ok.Diagnostics:   IReadOnlyList<MappingTrace>? (null when Diagnostics=false)
        // ok.RawHaystack:   string? (null unless IncludeRawHaystack=true)
        break;
    case DryRunFailed fail:
        // fail.Step (Retrieval/Extraction/Transformation/Publish/Unknown)
        // fail.FieldPath, fail.SourceText (≤256 chars, …-truncated),
        // fail.TargetTypeName, fail.InnerDetail
        break;
    case DryRunRejected rej:
        // rej.Reason: InvalidPdf | MalformedTemplate | ...
        break;
}

The same enrichment fields are now also present on FailedResult returned by ExecuteTemplateAsync — when a coercion fails, Step, FieldPath, SourceText, TargetTypeName, and InnerDetail are populated so callers can pinpoint the offending field without parsing exception messages.


Template storage

ITemplateStoreProvider (under Docuoria.Storage) abstracts how templates persist — SaveAsync / LoadAsync / ListAsync / DeleteAsync. The bundled LocalFileTemplateStoreProvider writes each template as {identifier}.json under a root directory using atomic temp-file + rename. ApiTemplateStoreProvider is the HTTP transport (see Hosted Template Store API below) wired through the same DI surface.

// Local filesystem provider
services.AddDocuoriaEngine(builder => builder.AddLocalTemplateStore("./templates"));

// HTTP provider (talks to Docuoria.Api)
services.AddDocuoriaEngine(builder =>
    builder.AddApiTemplateStore(
        new Uri("https://api.example.com/"),
        new ApiTemplateStoreCredentials { FunctionKey = "..." }));

Calling AddLocalTemplateStore and AddApiTemplateStore on the same builder replaces any previously registered ITemplateStoreProvider (last call wins).

Identifier safety. Identifiers must match [A-Za-z0-9_-]+ and be ≤ 200 characters. Path-traversal attempts (.., /, \, :) are rejected with InvalidTemplateIdentifierException before any path math runs.

Round-trip contract. Save → Load → ToJson is byte-identical to the original ToJson (UTF-8, no BOM). Missing identifiers surface as null from LoadAsync and false from DeleteAsync — there is intentionally no TemplateNotFoundException.


Hosted Template Store API

Docuoria.Api (under src/hosts/) is an Azure Functions isolated-worker host that exposes the same ITemplateStoreProvider surface over HTTP. It is the production transport for ApiTemplateStoreProvider; teams that need a shared template catalog point their SDK at the host instead of the local file provider.

Run locally

cd src/hosts/Docuoria.Api
Copy-Item local.settings.json.template local.settings.json
# Edit local.settings.json: set TemplateStore__RootPath to an absolute path.
func host start

The host listens on http://localhost:7071/ by default. On Azure, set TemplateStore__RootPath = D:\home\site\templates (mounted persistent storage).

Endpoints

Method Route Auth Success Problem types (RFC 7807)
GET /api/health anon 200 ok
POST /api/templates function 201 + Location 400 template-validation-failed, 409 template-already-exists, 415, 500
GET /api/templates function 200 {items:[]} 500 internal-error
GET /api/templates/{id} function 200 JSON 400 invalid-identifier, 404 template-not-found
PUT /api/templates/{id} function 200 / 201 400 invalid-identifier, 415, 500
DELETE /api/templates/{id} function 204 400 invalid-identifier, 404 template-not-found

All success responses send Cache-Control: no-store; all problem responses are application/problem+json with the same Cache-Control: no-store.

curl examples

KEY="<function-key>"
BASE="http://localhost:7071"

curl -sS "$BASE/api/health"
curl -sS -H "x-functions-key: $KEY" -H "Content-Type: application/json" \
     -d @template.json -X POST "$BASE/api/templates"
curl -sS -H "x-functions-key: $KEY" "$BASE/api/templates"
curl -sS -H "x-functions-key: $KEY" "$BASE/api/templates/my-template"
curl -sS -H "x-functions-key: $KEY" -H "Content-Type: application/json" \
     -d @template.json -X PUT "$BASE/api/templates/my-template"
curl -sS -H "x-functions-key: $KEY" -X DELETE "$BASE/api/templates/my-template"

Point the SDK at the API

services.AddDocuoriaEngine(b => b.AddApiTemplateStore(
    new Uri("http://localhost:7071/"),
    new ApiTemplateStoreCredentials { FunctionKey = "<KEY>" }));

Credential precedence — exactly one header is sent per request:

  1. FunctionKeyx-functions-key
  2. ApiKeyX-Api-Key
  3. BearerTokenAuthorization: Bearer <value>

Privacy

Only template JSON crosses the wire. The host never accepts, stores, or processes PDF bytes. Any request body other than application/json is rejected with 415 template-validation-failed. This invariant is asserted by PrivacyInvariantTests, which reflects over every [OpenApiOperation]-decorated method and fails the build if a future endpoint declares an application/pdf, application/octet-stream, or multipart/form-data body.


Design principles

  • Deterministic — same inputs produce the same outputs, every time.
  • Stateless — the engine holds no per‑invocation state; resolve as a singleton, invoke concurrently.
  • Immutable pipeline state — each step returns a new record; nothing in the pipeline is mutated in place.
  • Typed at the call site — step ↔ config and generator ↔ options pairings are enforced by generic constraints. Mismatches don't compile.
  • No PDF library leakage — PdfPig types never appear in public contracts.
  • Fixed by name, open by contract — component types are enumerated by name, but the contracts admit new rules, steps, and generators without engine changes.

Repository layout

src/libs/Docuoria/         Engine library (this README's subject)
src/libs/Docuoria.dotnet/  Layer-3 .NET client SDK (placeholder)
src/libs/Docuoria.nodejs/  Layer-3 Node.js client SDK (placeholder)
src/hosts/Docuoria.Api/    Layer-3 REST host (placeholder)
src/hosts/Docuoria.Portal/ Layer-3 portal (placeholder)
tests/Docuoria.Tests/      Unit + integration tests for the engine
specs/                              Product spec and deferred technical details
.planning/                          Phase plans, decisions, verification records

Roadmap

v1.3 (current) — Usability enhancements: JSON output generator, extraction diagnostics (opt‑in haystack + per‑mapping traces + block inventory), FallbackExtractionSource, NamedGroupSubFieldMapping, CollectionElementTransform, anchor‑scoped AllMatches (start/end sentinels), configurable blockSeparator, matchTimeout for ReDoS safety, locale‑aware coercion (parseFormat / cultureName on FieldMapping), rejection detail on RejectedResult.

v1.2 — Collection extraction: TableRowsExtractionSource and TextPatternExtractionSource.AllMatches for repeating data, RepeatingFieldMapping with typed sub‑field addressing (HeaderSubFieldMapping, OrdinalSubFieldMapping, RegexGroupSubFieldMapping), TemplateBuilder.WithRepeatingMapping() with build‑time schema validation, and CSV collection flattening (one row per element, scalar fields repeated).

v1.1 — Layer 1 engine with compile‑time typed step / rule references, fluent TemplateBuilder, generic ExecuteTemplateAsync<TGenerator, TOptions> overload, and convenience registration helpers. All seven built‑in match rules, all four step kinds, CSV output, HTTP retrieval provider.

Future milestones:

  • Layer 2 — Service layer (accounts, template stores, submission lifecycle, template resolution).
  • Layer 3 — REST API, .NET client SDK, Node.js client SDK, web portal.
  • Additional match rules (Font Fingerprint, Structural, Image Profile, Embedded Content, LLM Rule).
  • Additional output generators (XML).
  • LLM‑powered template generation.

Agent Scripts

The scripts/ directory packages each engine verb (inspect, test-pattern, test-groups, validate-template, dry-run, execute, evaluate-match, classify, list-templates, load-template, save-template) as a dotnet-script CLI with a deterministic JSON contract — single-line JSON on stdout for success, structured { "error": { "code", "message", "detail" } } on stderr with non-zero exit. The suite is designed for LLM agents and CI pipelines that need a typed, stateless surface over the SDK.

dotnet tool install -g dotnet-script
dotnet script scripts/classify.csx -- --pdf path\to\file.pdf

Template-store backed scripts accept --store-path, --store-url, and --store-key flags to configure the template store backend. See scripts/README.md for per-script arguments, output schemas, exit codes, and worked examples.


License

License to be determined.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

This package has no dependencies.

Version Downloads Last Updated
1.0.15 44 6/3/2026
1.0.14 41 6/3/2026
1.0.13 41 6/3/2026