DocNest.Parsers 0.2.0

.NET 8.0

dotnet add package DocNest.Parsers --version 0.2.0

NuGet\Install-Package DocNest.Parsers -Version 0.2.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="DocNest.Parsers" Version="0.2.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="DocNest.Parsers" Version="0.2.0" />
                    

                            Directory.Packages.props

<PackageReference Include="DocNest.Parsers" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add DocNest.Parsers --version 0.2.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: DocNest.Parsers, 0.2.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package DocNest.Parsers@0.2.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=DocNest.Parsers&version=0.2.0
                    

                            Install as a Cake Addin

#tool nuget:?package=DocNest.Parsers&version=0.2.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

DocNest .NET

Secure · Fast · Reliable · Cost-Effective

The document normalization engine RAG has always needed — native for .NET.

Install • Quick start • Library API • CLI • How it works • Packages • Accuracy

</div>

An idiomatic .NET / C# port of DocNest (docnest-ai on PyPI). DocNest reads a document's structure before its content — every heading becomes a navigable §section, every table is preserved as { caption, headers, rows[] } — so an LLM always receives the right section as context instead of a blind 512-char slice. The output is a portable .udf knowledge base, byte-compatible with the Python implementation.

Status: pre-1.0, built slice-by-slice under a gated protocol. Core pipeline, hybrid retrieval, cross-encoder reranking, and the 5-layer answer engine are implemented and tested.

Two independent choices

Embeddings run locally — a small ONNX MiniLM model (+ an optional ONNX cross-encoder reranker), downloaded once and cached. No API key, fully offline. (Cloud embedding providers such as OpenAI are supported in the Python engine but are not yet ported to .NET — embeddings here are local-only.)
The LLM is optional — Layers 0–1 answer factual questions at zero tokens, no key. Add a provider only for synthesis (Layers 2–4): any OpenAI-compatible endpoint (OpenAI, Groq, Cerebras, Together, OpenRouter), Anthropic, or a fully local Ollama / LM Studio server. Here "OpenAI" means the answer LLM, not embeddings.

The problem it solves

Most RAG pipelines ingest the same broken way — extract text → split every 512 chars → embed → hope — which shreds tables and splits clauses mid-sentence. The LLM gets noise and returns approximate answers. DocNest preserves structure:

// A revenue table survives as structured data the LLM can actually reason over:
{
  "section": "§4.2 Revenue by Region",
  "table": {
    "headers": ["Region", "Q2", "Q3", "Change"],
    "rows": [["Europe", "38.1%", "45.2%", "+7.1pp"], ["Asia", "29.3%", "41.7%", "+12.4pp"]]
  }
}

📦 Install

# Library — add what you need (DocNest.Abstractions comes transitively)
dotnet add package DocNest.Core        # pipeline, .udf reader/writer, normaliser
dotnet add package DocNest.Parsers     # md / html / csv / docx / xlsx / pdf
dotnet add package DocNest.Retrieval   # hybrid retriever (FTS5 + dense + rerank + RRF + graph)
dotnet add package DocNest.Query       # 5-layer answer engine + LLM providers
dotnet add package DocNest.Embeddings  # optional: local ONNX embeddings + cross-encoder reranker

# CLI — installs the `docnest` command
dotnet tool install -g DocNest.Cli

🚀 Quick start (60 seconds)

No API key, no internet — parse a document, save a .udf, and answer factual questions at 0 LLM tokens:

using DocNest;
using DocNest.Parsers;
using DocNest.Pipeline;
using DocNest.Query;
using DocNest.Retrieval;
using DocNest.Udf;

// 1. Parse → normalise → write a portable .udf knowledge base
var raw = await new ParserFactory().Get("report.pdf").ParseAsync("report.pdf");
var doc = new DocNestPipeline().Process(raw);
await new UdfWriter().WriteAsync(doc, "report.udf");

// 2. Load it back and ask a question (deterministic layers — no LLM)
var document = (await UdfReader.LoadAsync("report.udf")).ToDocument();

using var retriever = new HybridRetriever(".docnest_cache");
var engine = new DocNestQueryEngine(retriever);          // no LLM → Layers 0–1 only
var result = await engine.AnswerAsync(document, "What was Q3 revenue?", allowLlm: false);

Console.WriteLine(result.Answer);      // e.g. "Q3 revenue: $38M (source: §3.1)"
Console.WriteLine(result.LayerUsed);   // 0 or 1 — answered from the index
Console.WriteLine(result.TokensUsed);  // 0

🧰 Library API

Add an LLM (Layers 2–4) — any OpenAI-compatible endpoint

OpenAiCompatibleLlmProvider works with OpenAI, Groq, Cerebras, Together, OpenRouter, and local servers (Ollama, LM Studio) — just change the base URL and model:

using DocNest;
using DocNest.Query;

// Groq (generous free tier) — or OpenAI, Cerebras, Ollama, …
ILlmProvider llm = new OpenAiCompatibleLlmProvider(
    apiKey:  Environment.GetEnvironmentVariable("GROQ_API_KEY")!,
    model:   "llama-3.3-70b-versatile",
    baseUrl: "https://api.groq.com/openai/v1");

using var retriever = new HybridRetriever(".docnest_cache");
var engine = new DocNestQueryEngine(retriever, llm);
var result = await engine.AnswerAsync(document, "Summarise the key risks.", allowLlm: true);

Console.WriteLine(result.Answer);
Console.WriteLine(string.Join(", ", result.Citations));   // e.g. ["§5.2", "§5.3"]
Console.WriteLine($"Layer {result.LayerUsed} · {result.TokensUsed} tokens · conf {result.Confidence:F2}");

// Local, fully offline via Ollama (OpenAI-compatible endpoint)
ILlmProvider local = new OpenAiCompatibleLlmProvider("ollama", "qwen2.5", "http://localhost:11434/v1");

// Anthropic Claude
ILlmProvider claude = new AnthropicLlmProvider(
    Environment.GetEnvironmentVariable("ANTHROPIC_API_KEY")!, "claude-haiku-4-5-20251001");

Turn on semantic retrieval (dense embeddings)

The MiniLM ONNX model (~90 MB) is downloaded once on first use and cached locally — fully local, no cloud:

using DocNest.Embeddings;

var (modelPath, vocabPath) = await MiniLmModel.EnsureDownloadedAsync("./models/minilm");
using var embedder = new OnnxEmbedder(modelPath, vocabPath);

// BM25 + dense cosine + semantic-graph edges (degrades to BM25-only if no embedder is passed)
using var retriever = new HybridRetriever(".docnest_cache", embedder);

Add the cross-encoder reranker (best accuracy on dense PDFs)

using DocNest.Embeddings;

var (ceModel, ceVocab) = await CrossEncoderModel.EnsureDownloadedAsync("./models/ms-marco");
using var reranker = new OnnxCrossEncoderReranker(ceModel, ceVocab);

// Re-scores the top RRF candidates by true query↔section relevance → the right section reaches the LLM
using var retriever = new HybridRetriever(".docnest_cache", embedder, reranker);

Retrieve sections directly (no LLM)

var hits = await retriever.RetrieveAsync(document, "remaining carbon budget", k: 5);
foreach (var hit in hits)
    Console.WriteLine($"{hit.Section.Id}  {hit.Section.Title}  (score {hit.Score:F3})");

Parse any supported format / register a custom parser

using DocNest;
using DocNest.Parsers;

var factory = new ParserFactory();                  // md, html, csv, docx, xlsx, pdf built in
var raw = await factory.Get("data.xlsx").ParseAsync("data.xlsx");
Console.WriteLine($"{raw.Sections.Count} sections");

// Add your own format — implement IParser and register it (first match wins)
factory.Register(new MyFormatParser());             // class MyFormatParser : IParser

Inspect a `.udf`

var package = await UdfReader.LoadAsync("report.udf");
Console.WriteLine($"Title:       {package.Manifest.Title}");
Console.WriteLine($"UDF version: {package.Manifest.UdfVersion}");
Console.WriteLine($"Sections:    {package.Catalogue.SectionIndex.Count}");
Console.WriteLine($"Key numbers: {package.Catalogue.KeyNumbers.Count}");

🖥 CLI

dotnet tool install -g DocNest.Cli      # provides the `docnest` command

# Convert a document to .udf (-q float32|float16|int8|binary, default float16)
docnest convert report.pdf -o report.udf

# Ask a question (deterministic layers by default; add an LLM for Layers 2–4)
docnest query report.udf "What was Q3 revenue?"
docnest query report.udf "Summarise the risks." \
  --provider openai --model llama-3.3-70b-versatile \
  --base-url https://api.groq.com/openai/v1 --api-key $GROQ_API_KEY

# Catalogue summary
docnest info report.udf

🧠 How it works

A document is normalised once, then queried forever:

file  → IParser → DocNestPipeline (normalise · key-numbers · keywords) → Document → .udf
query → HybridRetriever (BM25 + dense + cross-encoder rerank + RRF + 1-hop graph) → top-k sections
      → DocNestQueryEngine (5 layers) → answer (+ citations, tokens, confidence)

The .udf is a self-contained ZIP — manifest.json (version, model) + catalogue.json (section index, key-numbers, keywords) + content.json (section text/tables) + embeddings.bin (quantised vectors) — portable and byte-compatible with the Python engine.

Five answer layers — escalate only as needed

Layer	Mechanism	Tokens
0	Pre-computed key-numbers / summary	0
1	Extractive from the top section	0
2	Single-section LLM	~300
3	Multi-section synthesis (reranked context)	~900
4	Broad fallback over retrieved sections	~1,500

Layers 0–1 answer many factual questions at zero LLM cost; the engine escalates to the LLM only when the deterministic layers aren't confident.

📦 Packages

Package	Role
`DocNest.Abstractions`	Domain records + wrapper interfaces (`IParser`, `IEmbedder`, `IReranker`, `IRetriever`, `ILlmProvider`)
`DocNest.Core`	Pipeline, normaliser, `.udf` reader/writer, quantizer
`DocNest.Parsers`	md / html / csv / docx / xlsx / pdf parsers
`DocNest.Embeddings`	ONNX MiniLM embedder + ms-marco cross-encoder reranker
`DocNest.Retrieval`	Hybrid retriever (FTS5 BM25 + dense + rerank + RRF + graph)
`DocNest.Query`	5-layer answer engine + LLM providers
`DocNest.Storage`	`.udf` ZIP storage backend
`DocNest.Cli`	`docnest` dotnet tool (`convert` / `query` / `info`)

Every external dependency sits behind a DocNest wrapper interface; package versions are centrally pinned.

📂 Supported formats

pdf (PdfPig, font-size heading detection) · docx / xlsx (OpenXML) · html (AngleSharp) · csv / tsv · markdown. Tables are preserved as structured { caption, headers, rows[] }, never flattened.

🧪 Accuracy

A multi-format eval (10 documents · 88 questions · 5 formats — the same set as the Python reference) tracks parity. Latest run — dense + cross-encoder rerank, gpt-oss-120b narrator, qwen2.5 judge:

Format	Score	Hit-rate (≥7)
📊 XLSX	8.7 / 10	93%
📋 MD	8.7 / 10	100%
📝 DOCX	7.0 / 10	79%
🌐 HTML	4.8 / 10	50%
📄 PDF	6.8 / 10	70%
Overall	~7.1 / 10	~78%

The cross-encoder reranker lifted PDFs from 5.1 → 6.8 (hit-rate 47% → 70%). Honest and reproducible — see eval/. The Python reference's honest figure is 8.5/10 with gpt-oss-120b; this .NET port is closing the gap slice by slice.

🛠 Development

Built under a mandatory gated protocol: understand (BA / Dev / QA + roadmap) → plan → impact/risk → design + ADR → tests-first → full suite green → owner sign-off per phase. No change may break the .udf cross-ecosystem contract, UDF_VERSION, or the public API.

Doc	Purpose
CHARTER	Vision, audience, success metrics
DEVELOPMENT_PROTOCOL	The gated workflow
ROADMAP	Slices and milestones
ADRs	Architecture decision records
Phase 0 docs	Per-slice BA / Dev / QA understanding

📄 License

MIT — free for commercial use. See LICENSE.

🔗 Ecosystem

Project	Description
docnest	The original Python engine (`pip install docnest-ai`)
udf-spec	Open specification for the `.udf` format

🔒 Secure · ⚡ Fast · 🛡️ Reliable · 💰 Cost-Effective

</div>

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- AngleSharp (>= 1.1.2)
- DocNest.Abstractions (>= 0.2.0)
- DocNest.Core (>= 0.2.0)
- DocumentFormat.OpenXml (>= 3.1.0)
- PdfPig (>= 0.1.14)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.2.0	55	6/15/2026