Mostlylucid.StyloExtract.Templates.Postgres
1.8.0-alpha.8
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.8
NuGet\Install-Package Mostlylucid.StyloExtract.Templates.Postgres -Version 1.8.0-alpha.8
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.8" />
<PackageVersion Include="Mostlylucid.StyloExtract.Templates.Postgres" Version="1.8.0-alpha.8" />
<PackageReference Include="Mostlylucid.StyloExtract.Templates.Postgres" />
paket add Mostlylucid.StyloExtract.Templates.Postgres --version 1.8.0-alpha.8
#r "nuget: Mostlylucid.StyloExtract.Templates.Postgres, 1.8.0-alpha.8"
#:package Mostlylucid.StyloExtract.Templates.Postgres@1.8.0-alpha.8
#addin nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.8&prerelease
#tool nuget:?package=Mostlylucid.StyloExtract.Templates.Postgres&version=1.8.0-alpha.8&prerelease
Mostlylucid.StyloExtract.Templates.Postgres
PostgreSQL-backed template index for StyloExtract. Implements the same ITemplateIndex contract as Mostlylucid.StyloExtract.Templates (the SQLite provider); swap providers via DI with no change to calling code.
When to use this instead of SQLite
Choose the Postgres provider when:
- Your deployment already runs PostgreSQL as its operational database (StyloBot commercial, multi-tenant SaaS)
- You need multiple extraction nodes sharing one template store (Npgsql pools connections; Postgres serialises concurrent writes natively)
- You plan to add pgvector cosine-similarity search in a future upgrade (the schema is forward-compatible)
The SQLite provider (Mostlylucid.StyloExtract.Templates) is the right choice for single-host or air-gapped deployments, CLI tools, and anywhere you want zero external dependencies.
Installation
dotnet add package Mostlylucid.StyloExtract.Templates.Postgres
Usage
// Register the Postgres provider. Call this instead of (or after) AddStyloExtract()
// to replace the SQLite ITemplateIndex with the Postgres one.
services.AddStyloExtractPostgres(o =>
o.ConnectionString = "Host=localhost;Port=5432;Database=styloextract;Username=se;Password=secret");
// Optional: register drift-triggered refit support (mirrors RefitOrchestrator for SQLite).
services.AddStyloExtractPostgresRefit(
driftRefitThreshold: 0.35,
observationsBeforeStable: 5,
versionHistoryDepth: 3);
Schema is applied idempotently on the first operation (CREATE TABLE IF NOT EXISTS). No migration tool required.
Storage model
| Table | Contents |
|---|---|
templates |
Template id (bytea), host hash, fingerprint, extractor JSON blob, version, observation count |
template_lsh_band_index |
LSH bucket rows for fast-path lookup |
template_observations |
Per-request observation vectors (bounded to last 100 per template) |
template_version_history |
Past extractor versions retained for diff generation |
Columns that are BLOB in SQLite are bytea in Postgres. Timestamps are bigint Unix milliseconds. No pgvector dependency in v1; vector similarity uses the same CPU-side cosine math as the SQLite provider.
AOT
This package sets IsAotCompatible=false because Npgsql requires runtime reflection for connection-string parsing. It will not break AOT builds in packages that do not reference it (sibling packages such as StyloExtract.Playwright remain AOT-safe).
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Mostlylucid.StyloExtract.Abstractions (>= 1.8.0-alpha.8)
- Mostlylucid.StyloExtract.Fingerprint (>= 1.8.0-alpha.8)
- Mostlylucid.StyloExtract.Templates (>= 1.8.0-alpha.8)
- Npgsql (>= 10.0.3)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.8.0-alpha.8 | 0 | 6/25/2026 |
| 1.8.0-alpha.4 | 0 | 6/25/2026 |
| 1.8.0-alpha.3 | 0 | 6/25/2026 |
| 1.8.0-alpha.2 | 0 | 6/25/2026 |
| 1.8.0-alpha.1 | 5 | 6/24/2026 |
StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================
In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.
What's new since 1.8.0-alpha.4
------------------------------
Mostlylucid.StyloExtract.Llm.LlamaSharp
New package. ILlmTextProvider implementation backed by LLamaSharp
0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
disk; the executor reads the model's chat template from GGUF
metadata so prompts written for Ollama work unchanged.
Wire-up:
services.AddStyloExtract(...);
services.AddStyloExtractLlamaSharp(o =>
{
o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
o.ContextSize = 8192;
o.GpuLayerCount = 0; // pure CPU target
});
services.AddStyloExtractLlmInducer("config/templates");
Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
tokens so the generator halts at the model's natural turn boundary
instead of echoing the chat template structure.
Known LLamaSharp 0.27 issue documented in the package README:
Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
by StatelessExecutor — the model emits Jinja2 template source
instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.
Model benchmark harness
New tests/StyloExtract.Llm.Benchmark project — runs the
cross-product of (models × pages) for template induction and
reports F1 / train-time / markdown-size matrices. Reuses WCXB
ground-truth shape (one HTML.gz per page id, one ground-truth
JSON) and the operator-template store path.
Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
the in-process backend; anything else hits Ollama. Lets one
bench compare server (Ollama) and embedded (LlamaSharp) backends
side-by-side with identical fixtures.
Recommended models (empirically validated)
For Ollama backend:
* qwen3.5:4b — 3 GB, ~26 s, F1 0.805 (default, best)
* qwen2.5-coder:3b — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
code-trained matters for
CSS selectors)
* qwen3.5:0.8b — 1 GB, ~5 s, F1 0.528 (tiny floor)
For LLamaSharp backend (use bartowski quants):
* Phi-4-mini-instruct Q4_K_M — 2.5 GB, verified working
* Qwen 3.5 4B Q4_K_M — 3 GB, verified working
* Qwen 2.5 Coder 3B Q4_K_M — 2 GB, verified working
OllamaTextProviderOptions default model bumped
Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
bench. The doc-comment now lists the smaller-and-faster pick and
the model families to avoid (thinking-mode budget burn).
Tests
494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
project covers ctor validation, missing-file behaviour, and
SkippableFact live-GGUF integration (skipped without
STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).
StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================
Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.
What's new since 1.8.0-alpha.3
------------------------------
SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)
Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
the existing 3.0.3 central pin lifts the resolved bundle off the
vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
`dotnet list package --vulnerable` on consumer projects now
returns clean.
PlaywrightHtmlFetcher.Dispose() (sync path)
The fetcher previously only implemented IAsyncDisposable. When
registered as a DI singleton (which AddStyloExtractPlaywright()
does), `using var sp = services.BuildServiceProvider()` — the
canonical sync pattern — threw at container shutdown:
InvalidOperationException: 'PlaywrightHtmlFetcher' type only
implements IAsyncDisposable. Use DisposeAsync to dispose the
container.
Add a sync Dispose() that block-waits on the async path. Container
disposal happens off the request hot path so the sync wait is safe.
Both fixes are backwards-compatible drop-in patches. No code changes
needed in consumer projects beyond bumping the package version.
492 tests across 10 projects, all green.
StyloExtract 1.8.0-alpha.3 - 2026-06-25
========================================
What's new since 1.8.0-alpha.2
------------------------------
Next.js __NEXT_DATA__ rehydration extractor
Next.js apps embed their page state in a JSON blob inside
<script id="__NEXT_DATA__" type="application/json">. Schemas vary
per site (Shopify Hydrogen uses pageProps.shopifyProductsPreloadedState,
news sites use pageProps.initialState.article.body) so the
extractor walks props.pageProps recursively and collects every
string value that looks like prose (>= 80 chars, contains a space,
isn't a URL / data URI / CSS variable / serialised JSON). Conservative
key-exclusion list keeps URLs and build metadata out of the result.
Chains next to the JSON-LD and Discourse rehydration fallbacks.
Content-role fallback gate
The chained fallback (JSON-LD -> Next.js -> Discourse -> body-text)
previously gated on the all-blocks text sum. That sum looked
healthy for pages where the heuristic emitted 3 KB of nav + footer +
boilerplate while finding zero MainContent — the renderer's
MainContentOnly / Wcxb profiles drop those roles anyway, so the
actual markdown is 0 chars. Switch the gate to content-role text
mass only. 18 catastrophic pages recovered without any new code,
just the gate change.
Playwright auto-fallback decorator
AddStyloExtractPlaywright() wires PlaywrightHtmlFetcher AND
decorates the existing ILayoutExtractor with a RenderingLayoutExtractor
that runs static extraction first, then re-fetches via Playwright
only when:
* the caller passed a non-null sourceUri
* the static result has < 200 chars of content-role text
* an IRenderedHtmlFetcher is wired in DI
File-only callers never trigger a render. Operators who don't want
the Chromium dependency simply don't add the package. Three guards
against wasted work: Playwright throws -> return static; rendered
HTML same length as static -> skip the re-extract; re-extract
yields no improvement -> return static.
Usage:
services.AddStyloExtract(...);
services.AddStyloExtractPlaywright();
492 tests across 10 projects, 6 new unit tests for the decorator
policy.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| Stage | F1 | Catastrophic |
|----------------------------------------|-------:|-------------:|
| 1.8.0-alpha.2 | 0.760 | 25 |
| + Next.js extractor | same | |
| + content-role fallback gate | 0.760 | 17 |
| + 14 LLM-trained YAMLs | 0.760 | 17 |
| (Playwright auto-fallback) | -- | |
Playwright auto-fallback is wired but not exercised in the WCXB
benchmark by default — needs `playwright install chromium`. Real-
world consumers with the package added see automatic recovery for
JS-rendered SPAs whose content is hydrated client-side.
StyloExtract 1.8.0-alpha.2 - 2026-06-25
========================================
LLM template-training loop, Discourse rehydration, plus a stack of
heuristic + selection fixes that move the WCXB dev split from F1 0.673
(post-1.7.1, MainContentOnly profile) to F1 0.760 (Wcxb plain-text
profile, with operator-trained templates + Discourse rehydration
active). Catastrophic extraction failures (pred_chars ≤ 5) drop from
92 of 1495 pages to 25.
Beats Readability on every page type. Closes the gap to Trafilatura by
~40% on Article + Documentation. Above v1.5.4 baseline (0.718) by
+0.042 — and that's keeping all the GFM markdown structure (sidebar
TOCs, blockquotes, GFM tables) in the runtime output, not stripping
to plain text for benchmark flattery.
What's new since 1.8.0-alpha.1
------------------------------
LLM template training loop (`stylo-extract template train`)
Operator-driven synchronous LLM template specialisation, the
counterpart to the existing async enrichment coordinator. Smart-
routes between induce (no template yet) and repair (template
exists but underperforms).
Closed-selector prompt: every selector the model can choose from
is enumerated from the actual page DOM via DocumentSelectorCatalog
and handed to the LLM in the prompt. Inventing selectors fails.
Post-parse AngleSharp validation: every selector the model returns
is run through doc.QuerySelectorAll. Selectors that match zero
elements are dropped; templates whose MainContent rule has no
surviving selector are rejected.
Repair prompt re-angled as a diagnostic: "why is this failing AND
how should it work for this page" instead of just "produce a
corrected template."
Hash-prefixed selectors (`#my-id`) are now properly quoted in
emitted YAML so they round-trip; the inducer also pre-repairs
unquoted hash selectors in the LLM response before parse.
OllamaTextProvider bumps NumPredict default 1024 → 4096
(reasoning-tagged models burn tokens on chain-of-thought before
the answer) and falls back to message.thinking when message.content
is empty.
`template repair` command + `LlmTemplateInducer.RepairFromSkeletonAsync`
+ production coordinator dispatch (TemplateEnrichmentJob.Kind +
LayoutExtractor enqueue on low-output existing-template hits).
Discourse data-preloaded rehydration
Discourse renders every page as an Ember.js SPA. Static HTML ships
near-zero post content; the actual topic + posts live in a JSON
blob in <div id="data-preloaded" data-preloaded="...JSON...">.
DiscourseRehydrationExtractor parses the JSON, walks
topic_NNN.post_stream.posts[*].cooked, strips tags, and emits the
result as a synthetic MainContent fallback block — same shape as
the existing JSON-LD fallback. Discourse powers 5 000+ public
forums; one upstream extractor covers them all.
WCXB lift: 6 of 13 catastrophic forum pages go from F1=0 to
F1=0.83–0.99. Forum category F1 0.477 → 0.535.
Wcxb plain-text profile
WCXB-style word-overlap benchmarks score against plain-text gold.
The default MainContentOnly / RagFull output emits GFM Markdown —
headings, lists, sidebar TOCs, multi-paragraph blockquotes — that
improves AI / human readability but registers as precision noise
against plain-text comparison.
New ExtractionProfile.Wcxb uses MainContentOnly's role-set but
emits each block's plain Text instead of its Markdown. Strictly
a benchmark / comparison profile — runtime callers keep their
existing profile and continue getting structured GFM.
Heuristic + selection fixes
DomCleaner: strip <select> globally so <option> text stops
leaking on category dropdowns. mostlylucid.net opened with 290+
category names dumped into the output; now opens with the actual
blog list.
IntraBlockCleaner: content-guard the contamination-hint substring
match. "sidebar" substring was eating WordPress / SNOFlex article
bodies whose class contained "sidebar-mode-single". 28 catastrophic
article pages recovered.
LayoutExtractor: body-text fallback for old-school flat HTML
without <main>/<article>/section wrappers. erikdemaine.org/foldcut
and similar plain H1/H2/P-under-body pages now extract.
LayoutExtractor: detect chrome-heavy applicator output as bug-out.
Stale templates applied to wrong-shape pages produced 1 char of
MainContent while combinedText looked fine (header + footer
selectors found chrome). esprit-barbecue, nike, rei collections
recovered.
HeuristicBlockClassifier: empty-semantic-wrapper handling and
body-spanning <form> fall-through. ASP.NET WebForms pages
(drainblasterbill, etc.) recovered.
Framework-content-class-hints: 20 new patterns — Discourse, phpBB,
vBulletin, PrestaShop, WooCommerce, Shopify, BigCommerce,
Squarespace, Webflow, Wix, Joomla, GitHub Pages, plus some misc.
Benchmark harness
WCXB harness gains --operator-templates <root> for loading
YAML files produced by `template train`, --page-ids for fast
repro of individual failures.
Aggregate WCXB (1495 dev pages, Wcxb profile):
| System | F1 | Precision | Recall |
|-------------------|-------:|----------:|-------:|
| StyloExtract v1.8.0-alpha.2 | 0.760 | 0.756 | 0.849 |
| rs-trafilatura | 0.859 | 0.863 | 0.890 |
| Trafilatura | 0.791 | 0.852 | 0.793 |
| Readability | 0.675 | 0.685 | 0.713 |
Compatibility
Backwards-compatible with 1.8.0-alpha.1. All changes are either new
code paths (Discourse extractor, Wcxb profile, train CLI), strictly
better selection (the heuristic fixes), or schema-additive
(TemplateEnrichmentJob gains optional Kind / BadMarkdownSample with
default Induce). Existing operator templates and trained YAMLs from
alpha.1 continue to work unchanged.
Suite: 486 tests across 10 projects, all green.