Mostlylucid.StyloExtract.Templates.Postgres 1.8.0-alpha.8

This is a prerelease version of Mostlylucid.StyloExtract.Templates.Postgres.

dotnet add package Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.8

NuGet\Install-Package Mostlylucid.StyloExtract.Templates.Postgres -Version 1.8.0-alpha.8

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.8" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.8" />
                    

                            Directory.Packages.props

<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.8

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Mostlylucid.StyloExtract.Templates.Postgres, 1.8.0-alpha.8"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Mostlylucid.StyloExtract.Templates.Postgres@1.8.0-alpha.8

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.8&prerelease
                    

                            Install as a Cake Addin

#tool nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.8&prerelease
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Mostlylucid.StyloExtract.Templates.Postgres

PostgreSQL-backed template index for StyloExtract. Implements the same ITemplateIndex contract as Mostlylucid.StyloExtract.Templates (the SQLite provider); swap providers via DI with no change to calling code.

When to use this instead of SQLite

Choose the Postgres provider when:

Your deployment already runs PostgreSQL as its operational database (StyloBot commercial, multi-tenant SaaS)
You need multiple extraction nodes sharing one template store (Npgsql pools connections; Postgres serialises concurrent writes natively)
You plan to add pgvector cosine-similarity search in a future upgrade (the schema is forward-compatible)

The SQLite provider (Mostlylucid.StyloExtract.Templates) is the right choice for single-host or air-gapped deployments, CLI tools, and anywhere you want zero external dependencies.

Installation

dotnet add package Mostlylucid.StyloExtract.Templates.Postgres

Usage

// Register the Postgres provider. Call this instead of (or after) AddStyloExtract()
// to replace the SQLite ITemplateIndex with the Postgres one.
services.AddStyloExtractPostgres(o =>
    o.ConnectionString = "Host=localhost;Port=5432;Database=styloextract;Username=se;Password=secret");

// Optional: register drift-triggered refit support (mirrors RefitOrchestrator for SQLite).
services.AddStyloExtractPostgresRefit(
    driftRefitThreshold: 0.35,
    observationsBeforeStable: 5,
    versionHistoryDepth: 3);

Schema is applied idempotently on the first operation (CREATE TABLE IF NOT EXISTS). No migration tool required.

Storage model

Table	Contents
`templates`	Template id (bytea), host hash, fingerprint, extractor JSON blob, version, observation count
`template_lsh_band_index`	LSH bucket rows for fast-path lookup
`template_observations`	Per-request observation vectors (bounded to last 100 per template)
`template_version_history`	Past extractor versions retained for diff generation

Columns that are BLOB in SQLite are bytea in Postgres. Timestamps are bigint Unix milliseconds. No pgvector dependency in v1; vector similarity uses the same CPU-side cosine math as the SQLite provider.

AOT

This package sets IsAotCompatible=false because Npgsql requires runtime reflection for connection-string parsing. It will not break AOT builds in packages that do not reference it (sibling packages such as StyloExtract.Playwright remain AOT-safe).

Full documentation and package family

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- Mostlylucid.StyloExtract.Abstractions (>= 1.8.0-alpha.8)
- Mostlylucid.StyloExtract.Fingerprint (>= 1.8.0-alpha.8)
- Mostlylucid.StyloExtract.Templates (>= 1.8.0-alpha.8)
- Npgsql (>= 10.0.3)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.8.0-alpha.8	0	6/25/2026
1.8.0-alpha.4	0	6/25/2026
1.8.0-alpha.3	0	6/25/2026
1.8.0-alpha.2	0	6/25/2026
1.8.0-alpha.1	5	6/24/2026

StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================

In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.

What's new since 1.8.0-alpha.4
------------------------------

Mostlylucid.StyloExtract.Llm.LlamaSharp

   New package. ILlmTextProvider implementation backed by LLamaSharp
   0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
   disk; the executor reads the model's chat template from GGUF
   metadata so prompts written for Ollama work unchanged.

   Wire-up:

       services.AddStyloExtract(...);
       services.AddStyloExtractLlamaSharp(o =>
       {
           o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
           o.ContextSize = 8192;
           o.GpuLayerCount = 0;        // pure CPU target
       });
       services.AddStyloExtractLlmInducer("config/templates");

   Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
   tokens so the generator halts at the model's natural turn boundary
   instead of echoing the chat template structure.

   Known LLamaSharp 0.27 issue documented in the package README:
   Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
   by StatelessExecutor — the model emits Jinja2 template source
   instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.

Model benchmark harness

   New tests/StyloExtract.Llm.Benchmark project — runs the
   cross-product of (models × pages) for template induction and
   reports F1 / train-time / markdown-size matrices. Reuses WCXB
   ground-truth shape (one HTML.gz per page id, one ground-truth
   JSON) and the operator-template store path.

   Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
   the in-process backend; anything else hits Ollama. Lets one
   bench compare server (Ollama) and embedded (LlamaSharp) backends
   side-by-side with identical fixtures.

Recommended models (empirically validated)

   For Ollama backend:
     * qwen3.5:4b           — 3 GB, ~26 s, F1 0.805 (default, best)
     * qwen2.5-coder:3b     — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
                                                      code-trained matters for
                                                      CSS selectors)
     * qwen3.5:0.8b         — 1 GB, ~5 s, F1 0.528 (tiny floor)

   For LLamaSharp backend (use bartowski quants):
     * Phi-4-mini-instruct Q4_K_M    — 2.5 GB, verified working
     * Qwen 3.5 4B Q4_K_M            — 3 GB, verified working
     * Qwen 2.5 Coder 3B Q4_K_M      — 2 GB, verified working

OllamaTextProviderOptions default model bumped

   Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
   bench. The doc-comment now lists the smaller-and-faster pick and
   the model families to avoid (thinking-mode budget burn).

Tests

   494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
   project covers ctor validation, missing-file behaviour, and
   SkippableFact live-GGUF integration (skipped without
   STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).

StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================

Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.

What's new since 1.8.0-alpha.3
------------------------------

SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)

   Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
   gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
   the existing 3.0.3 central pin lifts the resolved bundle off the
   vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
   `dotnet list package --vulnerable` on consumer projects now
   returns clean.

PlaywrightHtmlFetcher.Dispose() (sync path)

   The fetcher previously only implemented IAsyncDisposable. When
   registered as a DI singleton (which AddStyloExtractPlaywright()
   does), `using var sp = services.BuildServiceProvider()` — the
   canonical sync pattern — threw at container shutdown:

     InvalidOperationException: 'PlaywrightHtmlFetcher' type only
     implements IAsyncDisposable. Use DisposeAsync to dispose the
     container.

   Add a sync Dispose() that block-waits on the async path. Container
   disposal happens off the request hot path so the sync wait is safe.

Both fixes are backwards-compatible drop-in patches. No code changes
needed in consumer projects beyond bumping the package version.

492 tests across 10 projects, all green.

StyloExtract 1.8.0-alpha.3 - 2026-06-25
========================================

What's new since 1.8.0-alpha.2
------------------------------

Next.js __NEXT_DATA__ rehydration extractor

   Next.js apps embed their page state in a JSON blob inside
   <script id="__NEXT_DATA__" type="application/json">. Schemas vary
   per site (Shopify Hydrogen uses pageProps.shopifyProductsPreloadedState,
   news sites use pageProps.initialState.article.body) so the
   extractor walks props.pageProps recursively and collects every
   string value that looks like prose (>= 80 chars, contains a space,
   isn't a URL / data URI / CSS variable / serialised JSON). Conservative
   key-exclusion list keeps URLs and build metadata out of the result.

   Chains next to the JSON-LD and Discourse rehydration fallbacks.

Content-role fallback gate

   The chained fallback (JSON-LD -> Next.js -> Discourse -> body-text)
   previously gated on the all-blocks text sum. That sum looked
   healthy for pages where the heuristic emitted 3 KB of nav + footer +
   boilerplate while finding zero MainContent — the renderer's
   MainContentOnly / Wcxb profiles drop those roles anyway, so the
   actual markdown is 0 chars. Switch the gate to content-role text
   mass only. 18 catastrophic pages recovered without any new code,
   just the gate change.

Playwright auto-fallback decorator

   AddStyloExtractPlaywright() wires PlaywrightHtmlFetcher AND
   decorates the existing ILayoutExtractor with a RenderingLayoutExtractor
   that runs static extraction first, then re-fetches via Playwright
   only when:
     * the caller passed a non-null sourceUri
     * the static result has < 200 chars of content-role text
     * an IRenderedHtmlFetcher is wired in DI

   File-only callers never trigger a render. Operators who don't want
   the Chromium dependency simply don't add the package. Three guards
   against wasted work: Playwright throws -> return static; rendered
   HTML same length as static -> skip the re-extract; re-extract
   yields no improvement -> return static.

   Usage:

       services.AddStyloExtract(...);
       services.AddStyloExtractPlaywright();

   492 tests across 10 projects, 6 new unit tests for the decorator
   policy.

Aggregate WCXB (1495 dev pages, Wcxb profile):

   | Stage                                  |     F1 | Catastrophic |
   |----------------------------------------|-------:|-------------:|
   | 1.8.0-alpha.2                          | 0.760 |           25 |
   | + Next.js extractor                    | same |              |
   | + content-role fallback gate           | 0.760 |           17 |
   | + 14 LLM-trained YAMLs                 | 0.760 |           17 |
   | (Playwright auto-fallback)             |   --   |              |

   Playwright auto-fallback is wired but not exercised in the WCXB
   benchmark by default — needs `playwright install chromium`. Real-
   world consumers with the package added see automatic recovery for
   JS-rendered SPAs whose content is hydrated client-side.

StyloExtract 1.8.0-alpha.2 - 2026-06-25
========================================

LLM template-training loop, Discourse rehydration, plus a stack of
heuristic + selection fixes that move the WCXB dev split from F1 0.673
(post-1.7.1, MainContentOnly profile) to F1 0.760 (Wcxb plain-text
profile, with operator-trained templates + Discourse rehydration
active). Catastrophic extraction failures (pred_chars ≤ 5) drop from
92 of 1495 pages to 25.

Beats Readability on every page type. Closes the gap to Trafilatura by
~40% on Article + Documentation. Above v1.5.4 baseline (0.718) by
+0.042 — and that's keeping all the GFM markdown structure (sidebar
TOCs, blockquotes, GFM tables) in the runtime output, not stripping
to plain text for benchmark flattery.

What's new since 1.8.0-alpha.1
------------------------------

LLM template training loop (`stylo-extract template train`)

   Operator-driven synchronous LLM template specialisation, the
   counterpart to the existing async enrichment coordinator. Smart-
   routes between induce (no template yet) and repair (template
   exists but underperforms).

   Closed-selector prompt: every selector the model can choose from
   is enumerated from the actual page DOM via DocumentSelectorCatalog
   and handed to the LLM in the prompt. Inventing selectors fails.

   Post-parse AngleSharp validation: every selector the model returns
   is run through doc.QuerySelectorAll. Selectors that match zero
   elements are dropped; templates whose MainContent rule has no
   surviving selector are rejected.

   Repair prompt re-angled as a diagnostic: "why is this failing AND
   how should it work for this page" instead of just "produce a
   corrected template."

   Hash-prefixed selectors (`#my-id`) are now properly quoted in
   emitted YAML so they round-trip; the inducer also pre-repairs
   unquoted hash selectors in the LLM response before parse.

   OllamaTextProvider bumps NumPredict default 1024 → 4096
   (reasoning-tagged models burn tokens on chain-of-thought before
   the answer) and falls back to message.thinking when message.content
   is empty.

   `template repair` command + `LlmTemplateInducer.RepairFromSkeletonAsync`
   + production coordinator dispatch (TemplateEnrichmentJob.Kind +
   LayoutExtractor enqueue on low-output existing-template hits).

Discourse data-preloaded rehydration

   Discourse renders every page as an Ember.js SPA. Static HTML ships
   near-zero post content; the actual topic + posts live in a JSON
   blob in <div id="data-preloaded" data-preloaded="...JSON...">.
   DiscourseRehydrationExtractor parses the JSON, walks
   topic_NNN.post_stream.posts[*].cooked, strips tags, and emits the
   result as a synthetic MainContent fallback block — same shape as
   the existing JSON-LD fallback. Discourse powers 5 000+ public
   forums; one upstream extractor covers them all.

   WCXB lift: 6 of 13 catastrophic forum pages go from F1=0 to
   F1=0.83–0.99. Forum category F1 0.477 → 0.535.

Wcxb plain-text profile

   WCXB-style word-overlap benchmarks score against plain-text gold.
   The default MainContentOnly / RagFull output emits GFM Markdown —
   headings, lists, sidebar TOCs, multi-paragraph blockquotes — that
   improves AI / human readability but registers as precision noise
   against plain-text comparison.

   New ExtractionProfile.Wcxb uses MainContentOnly's role-set but
   emits each block's plain Text instead of its Markdown. Strictly
   a benchmark / comparison profile — runtime callers keep their
   existing profile and continue getting structured GFM.

Heuristic + selection fixes

   DomCleaner: strip <select> globally so <option> text stops
   leaking on category dropdowns. mostlylucid.net opened with 290+
   category names dumped into the output; now opens with the actual
   blog list.

   IntraBlockCleaner: content-guard the contamination-hint substring
   match. "sidebar" substring was eating WordPress / SNOFlex article
   bodies whose class contained "sidebar-mode-single". 28 catastrophic
   article pages recovered.

   LayoutExtractor: body-text fallback for old-school flat HTML
   without <main>/<article>/section wrappers. erikdemaine.org/foldcut
   and similar plain H1/H2/P-under-body pages now extract.

   LayoutExtractor: detect chrome-heavy applicator output as bug-out.
   Stale templates applied to wrong-shape pages produced 1 char of
   MainContent while combinedText looked fine (header + footer
   selectors found chrome). esprit-barbecue, nike, rei collections
   recovered.

   HeuristicBlockClassifier: empty-semantic-wrapper handling and
   body-spanning <form> fall-through. ASP.NET WebForms pages
   (drainblasterbill, etc.) recovered.

   Framework-content-class-hints: 20 new patterns — Discourse, phpBB,
   vBulletin, PrestaShop, WooCommerce, Shopify, BigCommerce,
   Squarespace, Webflow, Wix, Joomla, GitHub Pages, plus some misc.

Benchmark harness

   WCXB harness gains --operator-templates <root> for loading
   YAML files produced by `template train`, --page-ids for fast
   repro of individual failures.

Aggregate WCXB (1495 dev pages, Wcxb profile):

   | System            |     F1 | Precision | Recall |
   |-------------------|-------:|----------:|-------:|
   | StyloExtract v1.8.0-alpha.2 | 0.760 | 0.756 | 0.849 |
   | rs-trafilatura              | 0.859 | 0.863 | 0.890 |
   | Trafilatura                 | 0.791 | 0.852 | 0.793 |
   | Readability                 | 0.675 | 0.685 | 0.713 |

Compatibility

Backwards-compatible with 1.8.0-alpha.1. All changes are either new
code paths (Discourse extractor, Wcxb profile, train CLI), strictly
better selection (the heuristic fixes), or schema-additive
(TemplateEnrichmentJob gains optional Kind / BadMarkdownSample with
default Induce). Existing operator templates and trained YAMLs from
alpha.1 continue to work unchanged.

Suite: 486 tests across 10 projects, all green.