Docuoria.Cli
1.0.15
dotnet tool install --global Docuoria.Cli --version 1.0.15
dotnet new tool-manifest
dotnet tool install --local Docuoria.Cli --version 1.0.15
#tool dotnet:?package=Docuoria.Cli&version=1.0.15
nuke :add-package Docuoria.Cli --version 1.0.15
Docuoria
Template-driven structured data extraction for PDFs. You bring the PDF and a template; the engine matches, extracts, transforms, and renders structured output.
What this library is
PDF Pipeline is a stateless, in-process .NET 10 library built around a single mental model:
- Three nouns — Match Rules, Templates, Output Generators
- Two verbs — Evaluate Match Rule and Execute Template
Match rules decide whether a template applies to a given PDF. A template binds a class of PDFs to a structured output: a root match rule, a data schema, and an ordered pipeline (extraction → optional transforms/retrievals → publish). Output generators render the sealed result into a concrete format.
The engine is deterministic by design: same PDF + same template + same generator options ⇒ same bytes out, every time. No mutable per-call state, no hidden coordination, no surprises in concurrent callers.
The product is layered. This repository delivers Layer 1 — the engine itself:
Layer 3 — Transports & Clients REST API · .NET SDK · Node.js SDK (future)
Layer 2 — Service Accounts · Template Store · Submission lifecycle (future)
Layer 1 — Engine ◄── this library Stateless primitives · Match Rules · Steps · Output Generators
Quickstart
Install the package:
dotnet add package Docuoria
Wire the engine, build a one-field template, and extract — copy-paste runnable:
using Microsoft.Extensions.DependencyInjection;
using Docuoria.Configuration;
using Docuoria.Contracts;
using Docuoria.MatchRules;
using Docuoria.Models;
using Docuoria.Output.Csv;
using Docuoria.Pipeline.Extraction;
using Docuoria.Pipeline.Publish;
using Docuoria.Registration;
using Docuoria.Results;
var services = new ServiceCollection();
services.AddDocuoriaEngine(b => b.AddBuiltInMatchRules().AddCsvOutputGenerator());
var engine = services.BuildServiceProvider().GetRequiredService<IDocuoriaEngine>();
var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
new PrimitiveFieldDefinition("vendor", FieldType.String, isRequired: true),
});
var template = TemplateBuilder.Create("quickstart", new DataModel(schema))
.WithMatchRule<FileNameMatchRule, FileNameMatchRuleConfiguration>(
new FileNameMatchRuleConfiguration { Pattern = "**/*.pdf", Threshold = 1m })
.ExtractWith<ExtractionStep, ExtractionStepConfiguration>(new ExtractionStepConfiguration(new IFieldMapping[]
{
new FieldMapping("vendor", FieldType.String, MetadataFieldExtractionSource.Standard(MetadataField.Title)),
}))
.PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
.Build();
await using var pdf = File.OpenRead("invoice.pdf");
var result = await engine.ExecuteTemplateAsync<CsvOutputGenerator, CsvGeneratorOptions>(pdf, template, new CsvGeneratorOptions());
if (result is SucceededResult ok)
Console.WriteLine(System.Text.Encoding.UTF8.GetString(ok.Output.Payload.Span));
The walkthrough above is verified by ReadmeWalkthroughTests in the SDK test suite, so it cannot silently drift from the live engine surface.
Inspecting a PDF
Template authors often need to see what the engine sees before committing to a match rule
or template. The v1.4 InspectAsync API returns a read-only projection of a PDF — page
count, info-dictionary metadata, per-page flattened text, raw text blocks with bounds, and
capped table previews — without evaluating any rules or templates.
using Microsoft.Extensions.DependencyInjection;
using Docuoria.Configuration;
using Docuoria.Contracts;
using Docuoria.Models;
using Docuoria.Registration;
var services = new ServiceCollection();
services.AddDocuoriaEngine(builder => builder.AddBuiltInMatchRules());
var provider = services.BuildServiceProvider();
var engine = provider.GetRequiredService<IDocuoriaEngine>();
using var pdf = File.OpenRead("invoice.pdf");
// Inspect only page 1; cap each table preview at 5 body rows.
var inspection = await engine.InspectAsync(
pdf,
pageFilter: PageFilter.SinglePage(1),
options: new InspectOptions { MaxTablePreviewRows = 5 });
Console.WriteLine($"Pages: {inspection.PageCount}, Author: {inspection.Metadata.Author}");
foreach (var page in inspection.Pages)
{
Console.WriteLine($"--- Page {page.PageNumber} ---");
Console.WriteLine(page.FlattenedText);
foreach (var table in page.Tables)
Console.WriteLine($"Table: {table.TotalRowCount} rows total, header = [{string.Join(", ", table.HeaderPreview)}]");
}
InspectAsync is read-only and never throws on adversarial PDF input — unparseable files
return an empty result (PageCount = 0).
The companion TestPatternAsync and TestGroupsAsync APIs run a regex against the same
flattened haystack and return structured match / gap / per-group diagnostics — useful for
iterating on extraction patterns before wiring them into a template.
VS Code Copilot Skill
An LLM agent working in this repo can load the docuoria skill to learn the canonical extraction workflow, the extraction-source decision tree, an illustrative regex library, the failure-mode decision tree, and the local-processing privacy guarantee.
The skill follows the agentskills.io standard and is built as a portable AI plugin package — SKILL.md + references/ + scripts/ + assets/lib/ (bundled SDK DLL) + assets/schemas/ + examples/ + MANIFEST.json (per-file SHA-256). The same package shape is consumed in-repo by Copilot and is the unit we distribute to downstream agents.
- Source:
skills/docuoria/(flat markdown),scripts/*.csx,src/libs/Docuoria/. - Build the package:
./skills/build.ps1builds the SDK in Release, assemblesdist/docuoria/per the agentskills.io layout, rewrites in-package path references (_common.csx#r, example cross-doc links), generatesMANIFEST.json, and mirrors the result to.github/skills/docuoria/for in-repo Copilot. - Drift check:
./skills/build.ps1 -Checkre-hashesdist/docuoria/against itsMANIFEST.jsonand exits non-zero on drift. CI gate.
The skill is documentation + thin CLI — it does not change engine or host behaviour.
Installing the skill into your agent
The Docuoria AI skill ships as both an npm global package and a .NET global tool. Both produce an identical payload — choose whichever runtime is already in your environment.
npm (Node.js ≥ 20):
npm install -g @sidub/docuoria
docuoria init
.NET global tool:
dotnet tool install -g Docuoria.Cli
docuoria init
Both commands launch an interactive tool-picker showing all supported AI tools with their detection status. Select the tools you want, press Enter, and the skill files are scaffolded into the correct directories.
Non-interactive / scripted use:
# Install for specific tools
docuoria init --tools claude,cursor
# Install for all tools
docuoria init --tools all
# Re-apply after a CLI update
docuoria update
# See supported tools and their status
docuoria list-tools
# Check for drift
docuoria doctor
The installer is idempotent — files whose SHA-256 already matches are skipped, and locally modified files are not overwritten unless you pass --force.
Getting started
Prerequisites
- .NET 10 SDK or later
- Python 3.12 (only if you intend to use the Python step)
Build & test
dotnet build src/libs/Docuoria/Docuoria.csproj
dotnet test tests/Docuoria.Tests/Docuoria.Tests.csproj
Register the engine
The engine is configured once at composition time. The default convenience helpers register all seven built‑in match rules and the output generators in a few lines:
using Docuoria.Registration;
using Docuoria.Contracts;
services.AddDocuoriaEngine(builder =>
{
builder
.AddBuiltInMatchRules() // FileName, Metadata, TextPattern, TextAnchor,
// PageGeometry, Table, Composite
.AddCsvOutputGenerator() // CsvOutputGenerator + CsvGeneratorOptions
.AddJsonOutputGenerator(); // JsonOutputGenerator + JsonGeneratorOptions
});
If you only need a subset, drop down to the typed primitives:
services.AddDocuoriaEngine(builder =>
{
builder.AddMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>();
builder.AddMatchRule<CompositeMatchRule, CompositeMatchRuleConfiguration>();
builder.AddOutputGenerator<CsvOutputGenerator, CsvGeneratorOptions>();
builder.AddRetrievalProvider<HttpRetrievalProvider, HttpRetrievalProviderConfiguration>();
});
IDocuoriaEngine is registered as a singleton and is safe to resolve and invoke from any number of concurrent callers.
A complete, end‑to‑end example
The goal: take a PDF invoice, decide whether it looks like an invoice we recognize, extract three fields, validate them against a schema, and render the result as CSV.
1. Describe the output shape
A DataModel is the schema vocabulary the publish step enforces — every output instance must conform to it.
using Docuoria.Models;
var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
new PrimitiveFieldDefinition("vendor", FieldType.String, isRequired: true),
new PrimitiveFieldDefinition("total", FieldType.String, isRequired: true),
new PrimitiveFieldDefinition("date", FieldType.Date, isRequired: false),
});
var dataModel = new DataModel(schema);
2. Build the template
Everything below — match rule, extraction, optional transforms, publish — is assembled by TemplateBuilder. Each verb constrains its TStep/TConfig pair at compile time, and Build() rejects missing pieces before execution time:
using Docuoria.Configuration;
using Docuoria.MatchRules;
using Docuoria.Models;
using Docuoria.Pipeline.Extraction;
using Docuoria.Pipeline.Publish;
using Docuoria.Pipeline.Transformation;
var extractionConfig = new ExtractionStepConfiguration(new IFieldMapping[]
{
new FieldMapping("vendor", FieldType.String,
TextAnchorExtractionSource.Token(
region: new PdfBounds(x: 0, y: 0, width: 300, height: 100),
token: "Vendor:")),
new FieldMapping("total", FieldType.String,
TextPatternExtractionSource.Pattern(@"\$[\d,]+\.\d{2}")),
new FieldMapping("date", FieldType.Date,
MetadataFieldExtractionSource.Standard(MetadataField.CreationDate),
parseFormat: "MM/dd/yyyy"), // locale-aware coercion (optional)
});
var transformConfig = new TransformationStepConfiguration(new FieldTransform[]
{
new TrimTransform("vendor"),
new CastTransform("total", FieldType.Number),
new FormatTransform("date", "yyyy-MM-dd"),
});
var matchRuleConfig = new TextPatternMatchRuleConfiguration
{
Tokens = new[] { "INVOICE", "Amount Due" },
Mode = TextMatchMode.AllTokens,
Threshold = 0.8m,
};
var template = TemplateBuilder.Create("invoice-v1", dataModel)
.WithMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(matchRuleConfig)
.ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
.ThenTransform<TransformationStep, TransformationStepConfiguration>(transformConfig)
.PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
.Build();
Template's constructor is internal; the builder is the only way in. That is by design — it lets the type system catch the "wrong config for this step" mistake at the call site, not at runtime.
3. Decide whether the template applies
EvaluateMatchRuleAsync is the cheap probe. It returns a confidence in [0, 1] and a final decision (confidence ≥ threshold):
using Docuoria.Contracts;
var engine = serviceProvider.GetRequiredService<IDocuoriaEngine>();
await using var probeStream = File.OpenRead("invoice.pdf");
var match = await engine.EvaluateMatchRuleAsync(
probeStream,
template.RootMatchRule,
fileName: "invoice.pdf");
if (!match.IsMatch)
{
Console.WriteLine($"Skipping — confidence {match.Confidence:P0} below threshold.");
return;
}
4. Execute the template
ExecuteTemplateAsync<TGenerator, TOptions> runs the full pipeline and renders the output. The generator is chosen at compile time via the generic type parameter; the constraint where TGenerator : IOutputGenerator<TOptions> guarantees the options object you pass is exactly the shape the generator expects:
using Docuoria.Output.Csv;
using Docuoria.Results;
using System.Text;
await using var pdfStream = File.OpenRead("invoice.pdf");
ProcessingResult result = await engine.ExecuteTemplateAsync<CsvOutputGenerator, CsvGeneratorOptions>(
pdfStream,
template,
new CsvGeneratorOptions { Delimiter = ',' });
switch (result)
{
case SucceededResult success:
var csv = Encoding.UTF8.GetString(success.Output.Payload.Span);
Console.WriteLine(csv); // rendered output
Console.WriteLine(success.Output.ContentType); // e.g. "text/csv"
break;
case FailedResult failure:
// A step threw mid-pipeline.
Console.Error.WriteLine($"Step '{failure.StepIdentifier}' failed: {failure.ErrorMessage}");
break;
case RejectedResult rejection:
// The engine refused to run the request. Reason values:
// InvalidPdf · MalformedTemplate · UnknownOutputGenerator · GeneratorRejected
Console.Error.WriteLine($"Rejected: {rejection.Reason} — {rejection.Detail}");
break;
}
That is the whole loop: register, build a template, evaluate, execute, switch on the discriminated result.
Extracting repeating data
Many PDFs contain line items — table rows or repeating regex patterns — that don't fit a fixed‑field schema. Collection extraction handles this: declare a collection field in the data model, point a collection source at the repeating structure, and map sub‑fields within each element.
1. Describe the output shape with a collection field
var lineItemSchema = new RecordDefinition("LineItem", new FieldDefinition[]
{
new PrimitiveFieldDefinition("product", FieldType.String, isRequired: true),
new PrimitiveFieldDefinition("unitPrice", FieldType.Number, isRequired: false),
new PrimitiveFieldDefinition("quantity", FieldType.Integer, isRequired: false),
});
var schema = new RecordDefinition("Invoice", new FieldDefinition[]
{
new PrimitiveFieldDefinition("invoiceNumber", FieldType.String, isRequired: true),
new RecordFieldDefinition("lineItems", lineItemSchema, isCollection: true),
});
var dataModel = new DataModel(schema);
2. Build the template with repeating field mappings
Scalar fields use regular FieldMappings. The collection field gets a RepeatingFieldMapping that pairs a collection source with sub‑field mappings:
var extractionConfig = new ExtractionStepConfiguration(new IFieldMapping[]
{
new FieldMapping("invoiceNumber", FieldType.String,
TextPatternExtractionSource.Pattern(@"Invoice No[\s\S]*?(\d{10})")),
});
var repeating = new RepeatingFieldMapping(
collectionFieldName: "lineItems",
elementDefinition: lineItemSchema,
subFields: new SubFieldMapping[]
{
new HeaderSubFieldMapping("product", FieldType.String, "Product"),
new HeaderSubFieldMapping("unitPrice", FieldType.Number, "Unit Price"),
new HeaderSubFieldMapping("quantity", FieldType.Integer, "Qty"),
},
source: TableRowsExtractionSource.ByHeader());
var template = TemplateBuilder.Create("invoice-with-lines", dataModel)
.WithMatchRule<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(matchRuleConfig)
.ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
.WithRepeatingMapping(repeating)
.PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
.Build();
Build() validates each repeating mapping against the DataModel at template construction time: the target field must exist, must be a collection, and its element record shape must match the mapping's elementDefinition — including nested records compared by name and field structure.
3. Execute — CSV flattens one row per element
The rest of the loop is unchanged — EvaluateMatchRuleAsync, then ExecuteTemplateAsync. The CSV generator flattens the result: one output row per line item, with the scalar invoiceNumber repeated on every row:
invoiceNumber,lineItems.product,lineItems.unitPrice,lineItems.quantity
6297020453,ThinkPad X1,1299.00,2
6297020453,USB-C Dock,189.00,1
6297020453,Warranty 3Y,149.00,2
Empty collections produce a header row with zero data rows.
The same pattern works with regex‑based collection sources. To extract repeating text patterns instead of table rows, swap the source and sub‑field type:
var repeating = new RepeatingFieldMapping(
collectionFieldName: "matches",
elementDefinition: matchSchema,
subFields: new SubFieldMapping[]
{
new NamedGroupSubFieldMapping("code", FieldType.String, groupName: "code"),
new NamedGroupSubFieldMapping("amount", FieldType.Number, groupName: "amt"),
},
source: TextPatternExtractionSource.AllMatches(
@"(?<code>\w{10})\s+\$(?<amt>\d+\.\d{2})",
startAnchor: "LINE ITEMS",
endAnchor: "SUBTOTAL"));
NamedGroupSubFieldMapping addresses capture groups by name — more readable and resilient to pattern edits than ordinal indices. AllMatches accepts optional startAnchor/endAnchor sentinels to restrict matching to a region of the document (the text between the first occurrence of each).
When a primary extraction source might not match every PDF variant, wrap it with a fallback chain:
new FieldMapping("total", FieldType.Number,
new FallbackExtractionSource(
primary: TextPatternExtractionSource.Pattern(@"Grand Total:\s*\$(\d+\.\d{2})"),
fallback: TextPatternExtractionSource.Pattern(@"Amount Due:\s*\$(\d+\.\d{2})")));
FallbackExtractionSource tries the primary source first; if it returns null, the fallback is attempted. Fallbacks compose to arbitrary depth.
Composing match rules
Real‑world templates rarely match on a single signal. The builder exposes a nested composite sub‑builder so you can compose AND/OR/NOT trees without leaving the fluent surface:
using Docuoria.MatchRules;
var template = TemplateBuilder.Create("invoice-v2", dataModel)
.WithCompositeMatchRule(CompositeOperator.And, root => root
.Add<FileNameMatchRule, FileNameMatchRuleConfiguration>(
new FileNameMatchRuleConfiguration
{
Pattern = "**/invoices/**/*.pdf",
Mode = PatternMode.Glob,
Threshold = 1m,
})
.AddComposite(CompositeOperator.Or, anyOf => anyOf
.Add<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
new TextPatternMatchRuleConfiguration
{
Tokens = new[] { "INVOICE" }, Mode = TextMatchMode.AnyToken, Threshold = 1m,
})
.Add<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
new TextPatternMatchRuleConfiguration
{
Tokens = new[] { "Amount Due" }, Mode = TextMatchMode.AnyToken, Threshold = 1m,
})))
.ExtractWith<ExtractionStep, ExtractionStepConfiguration>(extractionConfig)
.PublishWith<PublishStep, PublishStepConfiguration>(new PublishStepConfiguration())
.Build();
Add<TRule, TConfig>(config, weight) adds a leaf rule with an optional aggregation weight; AddComposite(op, configure, weight) adds a nested grouping. The composite rule validates that Not has exactly one child at build time.
If you need to construct a reference outside the builder (for example to probe an ad‑hoc rule with EvaluateMatchRuleAsync), use the matching public factory:
using Docuoria.MatchRules;
IMatchRuleReference rule = MatchRuleReference.Create<TextPatternMatchRule, TextPatternMatchRuleConfiguration>(
new TextPatternMatchRuleConfiguration { Tokens = new[] new[] { "INVOICE" }, Threshold = 0.9m });
The same pattern exists for steps and retrieval providers via StepReference.Extraction<>(), StepReference.Transformation<>(), StepReference.Retrieval<>(), StepReference.Publish<>(), and StepReference.RetrievalProvider<>(). The builder uses these factories internally — you only reach for them directly when you need a reference outside the template assembly path.
Reference
The engine surface
The entire public contract is two methods on IDocuoriaEngine:
ValueTask<MatchResult> EvaluateMatchRuleAsync(
Stream pdfStream,
IMatchRuleReference ruleReference,
string? fileName = null,
CancellationToken cancellationToken = default);
ValueTask<ProcessingResult> ExecuteTemplateAsync<TGenerator, TOptions>(
Stream pdfStream,
Template template,
TOptions options,
bool diagnostics = false,
CancellationToken cancellationToken = default)
where TGenerator : IOutputGenerator<TOptions>
where TOptions : IGeneratorOptions;
Notes on the contract:
- The engine opens and disposes the stream you pass; you retain ownership of the underlying file or buffer.
Executereturns one ofSucceededResult,FailedResult, orRejectedResultfor domain outcomes. Null/empty arguments throwArgumentNullException/ArgumentException— those are programmer errors, not results.RejectedResultcarries an optionalDetailstring with a human‑readable explanation when the rejection originates from an output generator.- Cancellation propagates as
OperationCanceledException; it is never folded into a rejected result. RejectionReason.UnknownOutputGeneratoris returned only whenTGeneratorwas never registered. The options‑mismatch case is impossible by construction — the generic constraints forbid it.- Pass
diagnostics: trueto attach anExtractionDiagnosticssnapshot to the result (see Extraction diagnostics below). Zero overhead when disabled.
Built‑in match rules
| Rule | What it scores | Typical use |
|---|---|---|
FileNameMatchRule |
Glob or substring match on the supplied file name | Folder‑driven routing |
MetadataMatchRule |
Standard metadata fields (author, title, subject, keywords) | Vendor‑authored PDFs |
TextPatternMatchRule |
Token / regex hits across all pages | Document‑class detection |
TextAnchorMatchRule |
Text presence and location relative to an anchor | Form‑style layouts |
PageGeometryMatchRule |
Page count, dimensions, orientation, aspect ratio | Letter vs A4, statements vs reports |
TableMatchRule |
Table structure (rows, columns, cell content) | Tabular reports |
CompositeMatchRule |
Aggregates child rules under AND / OR / NOT | Layered matching |
Every leaf rule produces a confidence in [0, 1] and accepts an author‑supplied threshold; the rule fires when confidence ≥ threshold. Composites combine child confidences (weighted average for And, weighted max for Or, 1 − child for Not).
Pipeline steps
Every template has exactly one extraction step at the start and one publish step at the end. Between them, you can chain zero or more intermediate steps in declared order.
| Step | Position | Role |
|---|---|---|
ExtractionStep |
Anchor (start) | Seeds the initial DataRecord from the PDF via extraction sources |
TransformationStep |
Intermediate | Applies a declared sequence of field transforms |
RetrievalStep |
Intermediate | Calls a registered IRetrievalProvider<TConfig> to enrich the record |
PublishStep |
Anchor (end) | Validates against the DataModel and seals the output |
A Python step is planned alongside the existing intermediate kinds; the pythonnet dependency is already in place.
Extraction sources
Extraction sources are the where — they tell the extraction step where in the PDF each field comes from. Sources come in two flavors: scalar sources that produce a single value per field, and collection sources that produce one record per row or match.
Scalar sources
Each scalar source exposes typed static factories so the call site reads like prose:
| Source | Factories |
|---|---|
TextPatternExtractionSource |
.Token(token, pageNumber?, caseSensitive?, blockSeparator?), .Pattern(regex, pageNumber?, caseSensitive?, matchTimeout?, blockSeparator?) |
TextAnchorExtractionSource |
.Token(region, token, pageNumber?, caseSensitive?), .Pattern(region, regex, pageNumber?, caseSensitive?) |
MetadataFieldExtractionSource |
.Standard(MetadataField), .Raw(rawKey) |
TableCellExtractionSource |
.Ordinal(rowIndex, columnIndex, tableIndex?, pageNumber?), .ByHeader(rowIndex, headerToken, …) |
FallbackExtractionSource |
new FallbackExtractionSource(primary, fallback) — composable try/else chain |
Anchor regions use PdfBounds(x, y, width, height) in PDF points (top‑left origin, 1/72 inch units, rotation‑normalized).
matchTimeout guards against ReDoS on untrusted regex patterns (defaults to infinite — v1.1 behavior preserved). blockSeparator controls how text blocks are joined into the search haystack (defaults to "\n").
Collection sources
Collection sources iterate a repeating structure in the PDF and yield one record per element. They are paired with a RepeatingFieldMapping that describes the sub‑field addressing within each element:
| Source | Factories | Sub‑field type |
|---|---|---|
TableRowsExtractionSource |
.ByHeader(tableIndex?, pageNumber?, headerRowIndex?, caseSensitiveHeader?), .Ordinal(tableIndex?, pageNumber?, skipRows?) |
HeaderSubFieldMapping or OrdinalSubFieldMapping |
TextPatternExtractionSource |
.AllMatches(regex, pageNumber?, caseSensitive?, matchTimeout?, blockSeparator?, startAnchor?, endAnchor?) |
RegexGroupSubFieldMapping or NamedGroupSubFieldMapping |
TableRowsExtractionSource.ByHeader resolves sub‑field names against a header row; Ordinal addresses columns by zero‑based index. AllMatches produces one record per non‑overlapping regex match, with capture groups projected through sub‑field mappings. Optional startAnchor/endAnchor sentinels restrict matching to the text region between the first occurrence of each.
Sub‑field mappings describe how to locate each field within a single element:
| Sub‑field mapping | Addressing |
|---|---|
HeaderSubFieldMapping(fieldName, fieldType, headerToken, caseSensitive?) |
Table column by header text |
OrdinalSubFieldMapping(fieldName, fieldType, columnIndex) |
Table column by zero‑based index |
RegexGroupSubFieldMapping(fieldName, fieldType, groupIndex) |
Regex capture group by index (≥ 1) |
NamedGroupSubFieldMapping(fieldName, fieldType, groupName) |
Regex capture group by name |
Transforms
Transforms are the how — each one rewrites one or more fields in declared order. They run inside a TransformationStep configured with an ordered FieldTransform[]:
| Transform | Example |
|---|---|
TrimTransform |
new TrimTransform("vendor") |
CastTransform |
new CastTransform("total", FieldType.Number) |
FormatTransform |
new FormatTransform("date", "yyyy-MM-dd") |
RenameTransform |
new RenameTransform("amt", "amount") |
ComputeTransform |
new ComputeTransform("tax", ComputeOperator.Multiply, new[] { "subtotal", "rate" }) |
CollectionElementTransform |
new CollectionElementTransform("lineItems", new FieldTransform[] { new TrimTransform("product") }) |
CollectionElementTransform applies a sequence of inner transforms to each element's record within a named collection field — useful for trimming, casting, or renaming fields inside repeating data without writing per‑element boilerplate.
Data model
The schema vocabulary used by the publish step:
- Primitive types —
String,Number,Integer,Boolean,Date,Timestamp - Records — composite named‑field values via
RecordDefinition; nestable throughRecordFieldDefinition - Collections — ordered, repeated values (
isCollection: trueon any field definition) - Optionality — each field declares
isRequired; enforced byPublishStep
Output generators
An output generator renders a sealed DataModelInstance into a concrete format:
| Generator | Options | Content‑type | Collection handling |
|---|---|---|---|
CsvOutputGenerator |
CsvGeneratorOptions { Delimiter } |
text/csv |
Flattened: one row per element, scalars repeated. Multiple independent collections rejected. |
JsonOutputGenerator |
JsonGeneratorOptions { Indented, OmitNulls } |
application/json |
Natural: arrays at any depth, recursive record nesting. No single‑collection restriction. |
Both plug in through AddOutputGenerator<TGen, TOptions> and consume the same generic ExecuteTemplateAsync<,> overload. XML is on the roadmap.
Extraction diagnostics
Template authoring is iterative: you need to see what the engine "sees" before you can write correct field mappings. Pass diagnostics: true to get a zero‑allocation‑when‑disabled snapshot of the extraction internals:
using Docuoria.Diagnostics;
var result = await engine.ExecuteTemplateAsync<JsonOutputGenerator, JsonGeneratorOptions>(
pdfStream, template, new JsonGeneratorOptions(),
diagnostics: true);
if (result is SucceededResult success && success.Diagnostics is { } diag)
{
// The flattened text haystack the engine matched against:
Console.WriteLine(diag.Haystack);
// Per-mapping trace — did each field match? Where?
foreach (var trace in diag.MappingTraces)
{
Console.WriteLine($"{trace.FieldName}: matched={trace.Matched}, text={trace.MatchedText}");
if (trace.MatchIndex is not null)
Console.WriteLine($" offset={trace.MatchIndex} len={trace.MatchLength}");
if (trace.NamedGroups is { Count: > 0 } groups)
Console.WriteLine($" groups={string.Join(", ", groups.Select(g => $"{g.Key}={g.Value}"))}");
}
// Raw block inventory with bounding boxes (PDF points):
foreach (var block in diag.Blocks)
Console.WriteLine($" p{block.PageNumber}: [{block.X},{block.Y} {block.Width}×{block.Height}] {block.Content}");
}
You can also inspect the engine's text haystack directly without executing a template:
var haystack = TextSearch.ExtractText(pdfDocument);
TextSearch lives in Docuoria.Diagnostics and accepts optional pageNumber and blockSeparator parameters.
Dry-run for debugging
DryRunAsync executes a template's extraction + intermediate stages against a PDF and returns the projected record without running the publish step. Use it for template authoring, integration smoke tests, and field-level failure diagnosis — no output sink is required.
using var pdf = File.OpenRead("invoice.pdf");
var result = await engine.DryRunAsync(pdf, template, new DryRunOptions
{
Diagnostics = true, // collect MappingTrace per field (default true)
IncludeRawHaystack = false, // include extracted PDF text (opt-in; can be large)
PageFilter = null, // optional page subset
});
switch (result)
{
case DryRunSucceeded ok:
// ok.JsonProjection: IReadOnlyDictionary<string, object?>
// ok.Diagnostics: IReadOnlyList<MappingTrace>? (null when Diagnostics=false)
// ok.RawHaystack: string? (null unless IncludeRawHaystack=true)
break;
case DryRunFailed fail:
// fail.Step (Retrieval/Extraction/Transformation/Publish/Unknown)
// fail.FieldPath, fail.SourceText (≤256 chars, …-truncated),
// fail.TargetTypeName, fail.InnerDetail
break;
case DryRunRejected rej:
// rej.Reason: InvalidPdf | MalformedTemplate | ...
break;
}
The same enrichment fields are now also present on FailedResult returned by ExecuteTemplateAsync — when a coercion fails, Step, FieldPath, SourceText, TargetTypeName, and InnerDetail are populated so callers can pinpoint the offending field without parsing exception messages.
Template storage
ITemplateStoreProvider (under Docuoria.Storage) abstracts how templates persist — SaveAsync / LoadAsync / ListAsync / DeleteAsync. The bundled LocalFileTemplateStoreProvider writes each template as {identifier}.json under a root directory using atomic temp-file + rename. ApiTemplateStoreProvider is the HTTP transport (see Hosted Template Store API below) wired through the same DI surface.
// Local filesystem provider
services.AddDocuoriaEngine(builder => builder.AddLocalTemplateStore("./templates"));
// HTTP provider (talks to Docuoria.Api)
services.AddDocuoriaEngine(builder =>
builder.AddApiTemplateStore(
new Uri("https://api.example.com/"),
new ApiTemplateStoreCredentials { FunctionKey = "..." }));
Calling AddLocalTemplateStore and AddApiTemplateStore on the same builder replaces any previously registered ITemplateStoreProvider (last call wins).
Identifier safety. Identifiers must match [A-Za-z0-9_-]+ and be ≤ 200 characters. Path-traversal attempts (.., /, \, :) are rejected with InvalidTemplateIdentifierException before any path math runs.
Round-trip contract. Save → Load → ToJson is byte-identical to the original ToJson (UTF-8, no BOM). Missing identifiers surface as null from LoadAsync and false from DeleteAsync — there is intentionally no TemplateNotFoundException.
Hosted Template Store API
Docuoria.Api (under src/hosts/) is an Azure Functions isolated-worker host that exposes the same ITemplateStoreProvider surface over HTTP. It is the production transport for ApiTemplateStoreProvider; teams that need a shared template catalog point their SDK at the host instead of the local file provider.
Run locally
cd src/hosts/Docuoria.Api
Copy-Item local.settings.json.template local.settings.json
# Edit local.settings.json: set TemplateStore__RootPath to an absolute path.
func host start
The host listens on http://localhost:7071/ by default. On Azure, set TemplateStore__RootPath = D:\home\site\templates (mounted persistent storage).
Endpoints
| Method | Route | Auth | Success | Problem types (RFC 7807) |
|---|---|---|---|---|
| GET | /api/health |
anon | 200 ok |
— |
| POST | /api/templates |
function | 201 + Location |
400 template-validation-failed, 409 template-already-exists, 415, 500 |
| GET | /api/templates |
function | 200 {items:[]} |
500 internal-error |
| GET | /api/templates/{id} |
function | 200 JSON | 400 invalid-identifier, 404 template-not-found |
| PUT | /api/templates/{id} |
function | 200 / 201 | 400 invalid-identifier, 415, 500 |
| DELETE | /api/templates/{id} |
function | 204 | 400 invalid-identifier, 404 template-not-found |
All success responses send Cache-Control: no-store; all problem responses are application/problem+json with the same Cache-Control: no-store.
curl examples
KEY="<function-key>"
BASE="http://localhost:7071"
curl -sS "$BASE/api/health"
curl -sS -H "x-functions-key: $KEY" -H "Content-Type: application/json" \
-d @template.json -X POST "$BASE/api/templates"
curl -sS -H "x-functions-key: $KEY" "$BASE/api/templates"
curl -sS -H "x-functions-key: $KEY" "$BASE/api/templates/my-template"
curl -sS -H "x-functions-key: $KEY" -H "Content-Type: application/json" \
-d @template.json -X PUT "$BASE/api/templates/my-template"
curl -sS -H "x-functions-key: $KEY" -X DELETE "$BASE/api/templates/my-template"
Point the SDK at the API
services.AddDocuoriaEngine(b => b.AddApiTemplateStore(
new Uri("http://localhost:7071/"),
new ApiTemplateStoreCredentials { FunctionKey = "<KEY>" }));
Credential precedence — exactly one header is sent per request:
FunctionKey→x-functions-keyApiKey→X-Api-KeyBearerToken→Authorization: Bearer <value>
Privacy
Only template JSON crosses the wire. The host never accepts, stores, or processes PDF bytes. Any request body other than application/json is rejected with 415 template-validation-failed. This invariant is asserted by PrivacyInvariantTests, which reflects over every [OpenApiOperation]-decorated method and fails the build if a future endpoint declares an application/pdf, application/octet-stream, or multipart/form-data body.
Design principles
- Deterministic — same inputs produce the same outputs, every time.
- Stateless — the engine holds no per‑invocation state; resolve as a singleton, invoke concurrently.
- Immutable pipeline state — each step returns a new record; nothing in the pipeline is mutated in place.
- Typed at the call site — step ↔ config and generator ↔ options pairings are enforced by generic constraints. Mismatches don't compile.
- No PDF library leakage — PdfPig types never appear in public contracts.
- Fixed by name, open by contract — component types are enumerated by name, but the contracts admit new rules, steps, and generators without engine changes.
Repository layout
src/libs/Docuoria/ Engine library (this README's subject)
src/libs/Docuoria.dotnet/ Layer-3 .NET client SDK (placeholder)
src/libs/Docuoria.nodejs/ Layer-3 Node.js client SDK (placeholder)
src/hosts/Docuoria.Api/ Layer-3 REST host (placeholder)
src/hosts/Docuoria.Portal/ Layer-3 portal (placeholder)
tests/Docuoria.Tests/ Unit + integration tests for the engine
specs/ Product spec and deferred technical details
.planning/ Phase plans, decisions, verification records
Roadmap
v1.3 (current) — Usability enhancements: JSON output generator, extraction diagnostics (opt‑in haystack + per‑mapping traces + block inventory), FallbackExtractionSource, NamedGroupSubFieldMapping, CollectionElementTransform, anchor‑scoped AllMatches (start/end sentinels), configurable blockSeparator, matchTimeout for ReDoS safety, locale‑aware coercion (parseFormat / cultureName on FieldMapping), rejection detail on RejectedResult.
v1.2 — Collection extraction: TableRowsExtractionSource and TextPatternExtractionSource.AllMatches for repeating data, RepeatingFieldMapping with typed sub‑field addressing (HeaderSubFieldMapping, OrdinalSubFieldMapping, RegexGroupSubFieldMapping), TemplateBuilder.WithRepeatingMapping() with build‑time schema validation, and CSV collection flattening (one row per element, scalar fields repeated).
v1.1 — Layer 1 engine with compile‑time typed step / rule references, fluent TemplateBuilder, generic ExecuteTemplateAsync<TGenerator, TOptions> overload, and convenience registration helpers. All seven built‑in match rules, all four step kinds, CSV output, HTTP retrieval provider.
Future milestones:
- Layer 2 — Service layer (accounts, template stores, submission lifecycle, template resolution).
- Layer 3 — REST API, .NET client SDK, Node.js client SDK, web portal.
- Additional match rules (Font Fingerprint, Structural, Image Profile, Embedded Content, LLM Rule).
- Additional output generators (XML).
- LLM‑powered template generation.
Agent Scripts
The scripts/ directory packages each engine verb (inspect,
test-pattern, test-groups, validate-template, dry-run, execute,
evaluate-match, classify, list-templates, load-template, save-template)
as a dotnet-script CLI with a
deterministic JSON contract — single-line JSON on stdout for success, structured
{ "error": { "code", "message", "detail" } } on stderr with non-zero exit. The
suite is designed for LLM agents and CI pipelines that need a typed, stateless
surface over the SDK.
dotnet tool install -g dotnet-script
dotnet script scripts/classify.csx -- --pdf path\to\file.pdf
Template-store backed scripts accept --store-path, --store-url, and
--store-key flags to configure the template store backend. See
scripts/README.md for per-script arguments, output schemas,
exit codes, and worked examples.
License
License to be determined.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
This package has no dependencies.