Mostlylucid.StyloExtract.Ml 1.8.0-alpha.12

This is a prerelease version of Mostlylucid.StyloExtract.Ml.

dotnet add package Mostlylucid.StyloExtract.Ml --version 1.8.0-alpha.12

NuGet\Install-Package Mostlylucid.StyloExtract.Ml -Version 1.8.0-alpha.12

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Mostlylucid.StyloExtract.Ml" Version="1.8.0-alpha.12" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Mostlylucid.StyloExtract.Ml" Version="1.8.0-alpha.12" />
                    

                            Directory.Packages.props

<PackageReference Include="Mostlylucid.StyloExtract.Ml" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Mostlylucid.StyloExtract.Ml --version 1.8.0-alpha.12

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Mostlylucid.StyloExtract.Ml, 1.8.0-alpha.12"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Mostlylucid.StyloExtract.Ml@1.8.0-alpha.12

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Mostlylucid.StyloExtract.Ml&version=1.8.0-alpha.12&prerelease
                    

                            Install as a Cake Addin

#tool nuget:?package=Mostlylucid.StyloExtract.Ml&version=1.8.0-alpha.12&prerelease
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

StyloExtract.Ml

ML block classifier for StyloExtract. Augments the heuristic classifier on novel layouts (custom CSS frameworks, e-commerce SPAs) where the framework- content-class-hints catalog has no entry.

Phase 1 (this version): pure-C# AOT-clean per-element feature extractor. No model is loaded; no ONNX runtime; the package exposes ElementFeatureExtractor so consumers can dump features for training or test the extraction surface.

Phase 2+: ONNX-runtime inference, trained model embedded as a resource, MlBlockClassifier IBlockClassifier implementation, DI helper AddStyloExtractMl(). Tracked in docs/ml-classifier-v2-design.md.

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- Microsoft.Extensions.Logging.Abstractions (>= 10.0.5)
- Microsoft.ML.OnnxRuntime (>= 1.20.1)
- Mostlylucid.StyloExtract.Abstractions (>= 1.8.0-alpha.12)
- Mostlylucid.StyloExtract.Heuristics (>= 1.8.0-alpha.12)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.8.0-alpha.12	0	6/26/2026
1.8.0-alpha.11	0	6/26/2026
1.8.0-alpha.10	0	6/26/2026
1.8.0-alpha.9	0	6/25/2026
1.8.0-alpha.8	37	6/25/2026
1.8.0-alpha.4	40	6/25/2026
1.8.0-alpha.3	45	6/25/2026
1.8.0-alpha.2	37	6/25/2026
1.8.0-alpha.1	39	6/24/2026

StyloExtract 1.8.0-alpha.12 - 2026-06-26
=========================================

DI wire-up fix for deterministic-template YAML persistence
-----------------------------------------------------------

alpha.11 introduced DeterministicTemplateYamlSink + the
AddStyloExtractOperatorTemplates registration, but AddStyloExtract's
LayoutExtractor construction did not pass the sink through to the
extractor — so even when the sink was registered in DI, LayoutExtractor's
optional ctor parameter defaulted to null and no `<host>-deterministic.yaml`
file was ever written.

Fixed by threading `sp.GetService<DeterministicTemplateYamlSink>()` to the
LayoutExtractor constructor in AddStyloExtract. No API change; consumers
who already called AddStyloExtractOperatorTemplates start seeing
deterministic YAML files immediately after upgrading.

StyloExtract 1.8.0-alpha.11 - 2026-06-26
=========================================

Sequenced architecture extension: deterministic templates with
extended classification — Title role, Sitemap profile, deterministic
YAML persistence, and a sitemap CLI verb.

Title BlockRole
---------------

New BlockRole.Title value distinguishes the page-level <h1> (the single
H1 the rest of the page is "about") from intra-content Heading
(H2/H3/H4 inside the body). HeuristicBlockClassifier surfaces the Title
via a shared PageTitleDetector helper, picking the H1 in/closest-to
<main>/<article> and falling back to earliest-in-document with multiple
H1s. ExtractorApplicator surfaces Title on the fast-path / applicator
branch too, so output stays consistent across novel and cached requests
(matters for the response-cache ETag). LlmInducerPrompts list Title in
the allowed-roles set with a one-line distinction from Heading.

MainContentOnly, RagFull, Wcxb, and AgentNavigation profiles all
include Title in their role-set. The renderer quality gate (drop short
text) bypasses for Title and Heading so intentionally-terse page
titles ("Home", "About") still surface.

Sitemap ExtractionProfile
-------------------------

New ExtractionProfile.Sitemap value emits only Title + Heading +
PrimaryNavigation + SecondaryNavigation + Breadcrumb. For sitemap /
outline / crawler use cases that want page titles and the site's nav
structure without pulling body content. The CLI's --profile flag
recognises `sitemap` automatically (enum binding).

Deterministic YAML persistence
------------------------------

New DeterministicTemplateYamlSink, wired automatically when
AddStyloExtractOperatorTemplates(root) is called, writes
<host>-deterministic.yaml alongside each heuristic-induced template's
SQLite row. The file carries every role the heuristic detected (Title,
MainContent, Navigation, Footer, …) — auditable, hand-editable, and
diffable, mirroring how LLM-induced templates have always been written
by TemplateEnrichmentCoordinator. The SQLite store remains the
authoritative source at match time; YAML is best-effort and
non-blocking.

stylo-extract sitemap CLI verb
------------------------------

New `sitemap` subcommand: takes one or more starting URLs, extracts
each with ExtractionProfile.Sitemap, follows internal nav links to
--max-depth (default 3), and emits a markdown tree of titles + URLs to
stdout or --out <file>. Safety caps: 50 pages by default
(--max-pages), 1s between requests (--delay-ms), no off-host follow.

Migration
---------

No source change required for consumers. The new Title role is
additive (existing switches that handled BlockRole pattern-match
defaults will continue to compile and behave identically; switches
that exhaustively listed roles were updated). Deterministic YAML
writing only activates when AddStyloExtractOperatorTemplates is
called, so consumers that don't use operator templates see no new
filesystem activity.

StyloExtract 1.8.0-alpha.10 - 2026-06-26
=========================================

LLM classification accuracy for chrome patterns
------------------------------------------------

Symptom: induced templates were labelling language pickers, filter UI,
locale switchers, and pagination strips as MainContent on
server-rendered blogs (mostlylucid.net being the canonical reproducer).
The downstream RagFull renderer's role-filter — which already drops
PrimaryNavigation / SecondaryNavigation / Form / Boilerplate — never
saw them as nav and so left them in the extracted markdown, producing
output WORSE than the deterministic heuristic.

Fix: expanded the induction and repair system prompts with explicit
"chrome pattern → role" examples (language picker → PrimaryNavigation;
filter / faceted-search → Form; pagination → SecondaryNavigation;
cookie banner → CookieBanner; newsletter signup → Form; social-share
→ Boilerplate). Also nudged the model to prefer narrower MainContent
selectors that don't include chrome as nested children.

DomSkeletonRenderer now surfaces structural ARIA attributes (`role`,
`aria-label`, `aria-labelledby`) alongside each element's tag / class /
id, giving the LLM more signal for distinguishing landmark regions
(nav / form / banner) from content. The hash-class-name filter is also
slightly less aggressive: pure PascalCase ids (e.g. `LanguageDropDown`)
now survive into the skeleton so the LLM can use them as selectors,
while real CSS-module hashes (mixed-case + digits, or 4+ case
transitions) are still dropped.

The renderer side (TypedMarkdownRenderer.ShouldEmit) is unchanged —
it was already correctly filtering by role. The fix is entirely about
label accuracy and the signal the LLM sees.

Migration: no source change required for consumers; templates induced
post-1.8.0-alpha.10 will produce cleaner output under
ExtractionProfile.RagFull and MainContentOnly. Cached templates
induced under earlier alphas will keep producing the old output until
they're refit (centroid drift triggers refit automatically; or
operators can manually clear the template store).

Regression tests: tests/StyloExtract.Core.Tests adds
LlmInducerPromptAntiPatternTests (prompt snapshot) and
MostlylucidLlmInductionRegressionTests (applies a synthetic bad-wide-
wrapper template against a captured mostlylucid.net fixture, proves
the language-picker / filter chrome leaks; then shows a properly
authored RepeatedItem template excludes per-card chrome cleanly).

StyloExtract 1.8.0-alpha.9 - 2026-06-25
========================================

App-safe AddStyloExtract + LlmInductionFired flag
--------------------------------------------------

Two changes that downstream desktop / CLI consumers (e.g. lucidVIEW-FULL)
need:

1. The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
  DI extension and its companion `StyloExtractOptions` type now live in
  `Mostlylucid.StyloExtract.Core` instead of `Mostlylucid.StyloExtract.AspNetCore`.
  Non-AspNetCore hosts (desktop apps, CLI tools, console workers) can call:

      services.AddStyloExtract(o => o.StorePath = "templates.db");

  without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).

  `Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
  overloads (response-policy framework, markdown content negotiation
  middleware, operator-template minimal-API endpoints) — those legitimately
  need AspNetCore. They now delegate to the Core overload internally.

  Migration: no source change for AspNetCore consumers. Desktop / CLI
  consumers can reference `Mostlylucid.StyloExtract.Core` alone.

2. `ExtractionResult.LlmInductionFired` (new bool property) signals
  whether the LLM template inducer ran during this extraction. Downstream
  telemetry surfaces (e.g. status bars, NDJSON exports) can now show LLM
  utilisation per call without reflection or polling internal state.
  Defaults to false for non-LLM hosts and heuristic-only extractions;
  set true only when the LlmTemplateInducer (or any future ILlmTextProvider-
  backed inducer) actually invoked the LLM.

StyloExtract 1.8.0-alpha.6 - 2026-06-25
========================================

App-safe AddStyloExtract — moved to StyloExtract.Core
------------------------------------------------------

The basic `AddStyloExtract(IServiceCollection, Action<StyloExtractOptions>?)`
DI extension and its companion `StyloExtractOptions` type have moved from
`Mostlylucid.StyloExtract.AspNetCore` to `Mostlylucid.StyloExtract.Core`.
Desktop, CLI, and any non-AspNetCore host can now call:

   services.AddStyloExtract(o => o.StorePath = "templates.db");

without pulling `Microsoft.AspNetCore.App` (~70 MB of framework runtime).

`Mostlylucid.StyloExtract.AspNetCore` keeps its `Action<ResponsePolicyBuilder>`
overloads (response-policy framework, markdown content negotiation
middleware, operator-template minimal-API endpoints) — those legitimately
need AspNetCore. They now delegate to the Core overload internally.

Migration: no source change required for AspNetCore consumers. Desktop /
CLI consumers can drop direct dependencies on `Mostlylucid.StyloExtract.AspNetCore`
and reference `Mostlylucid.StyloExtract.Core` alone.

StyloExtract 1.8.0-alpha.5 - 2026-06-25
========================================

In-process CPU LLM backend (LLamaSharp) + 13-model bench harness.
Operators can now embed a single ~2-3 GB GGUF model in the host
process — no Ollama server, no separate LLM daemon. Same
ILlmTextProvider contract as the Ollama backend, so the
LlmTemplateInducer + production enrichment coordinator + CLI
`template train` all work unchanged.

What's new since 1.8.0-alpha.4
------------------------------

Mostlylucid.StyloExtract.Llm.LlamaSharp

   New package. ILlmTextProvider implementation backed by LLamaSharp
   0.27 (the .NET binding for llama.cpp). Loads a GGUF model from
   disk; the executor reads the model's chat template from GGUF
   metadata so prompts written for Ollama work unchanged.

   Wire-up:

       services.AddStyloExtract(...);
       services.AddStyloExtractLlamaSharp(o =>
       {
           o.ModelPath = "/var/models/Phi-4-mini-instruct-Q4_K_M.gguf";
           o.ContextSize = 8192;
           o.GpuLayerCount = 0;        // pure CPU target
       });
       services.AddStyloExtractLlmInducer("config/templates");

   Anti-prompt set covers Qwen, Phi, Llama 3+, and Gemma 4 stop
   tokens so the generator halts at the model's natural turn boundary
   instead of echoing the chat template structure.

   Known LLamaSharp 0.27 issue documented in the package README:
   Gemma 4 E2B / E4B's chat template metadata isn't applied cleanly
   by StatelessExecutor — the model emits Jinja2 template source
   instead of YAML. Phi-4-mini, Qwen 2.5 Coder, Llama 3.2 work fine.

Model benchmark harness

   New tests/StyloExtract.Llm.Benchmark project — runs the
   cross-product of (models × pages) for template induction and
   reports F1 / train-time / markdown-size matrices. Reuses WCXB
   ground-truth shape (one HTML.gz per page id, one ground-truth
   JSON) and the operator-template store path.

   Model spec routing: `llamasharp:/path/to/file.gguf` resolves via
   the in-process backend; anything else hits Ollama. Lets one
   bench compare server (Ollama) and embedded (LlamaSharp) backends
   side-by-side with identical fixtures.

Recommended models (empirically validated)

   For Ollama backend:
     * qwen3.5:4b           — 3 GB, ~26 s, F1 0.805 (default, best)
     * qwen2.5-coder:3b     — 2 GB, ~21 s, F1 0.767 (smaller-and-faster pick;
                                                      code-trained matters for
                                                      CSS selectors)
     * qwen3.5:0.8b         — 1 GB, ~5 s, F1 0.528 (tiny floor)

   For LLamaSharp backend (use bartowski quants):
     * Phi-4-mini-instruct Q4_K_M    — 2.5 GB, verified working
     * Qwen 3.5 4B Q4_K_M            — 3 GB, verified working
     * Qwen 2.5 Coder 3B Q4_K_M      — 2 GB, verified working

OllamaTextProviderOptions default model bumped

   Default tag was gemma4:e4b-it-qat; switched to qwen3.5:4b per the
   bench. The doc-comment now lists the smaller-and-faster pick and
   the model families to avoid (thinking-mode budget burn).

Tests

   494 across 11 projects. New StyloExtract.Llm.LlamaSharp.Tests
   project covers ctor validation, missing-file behaviour, and
   SkippableFact live-GGUF integration (skipped without
   STYLOEXTRACT_LLAMASHARP_MODEL env var pointing at a GGUF file).

StyloExtract 1.8.0-alpha.4 - 2026-06-25
========================================

Tiny patch alpha to fix two consumer-facing bugs found while smoke-
installing alpha.3 against NuGet.

What's new since 1.8.0-alpha.3
------------------------------

SQLite chain CVE patched (GHSA-2m69-gcr7-jv3q)

   Microsoft.Data.Sqlite bumped 10.0.1 -> 10.0.9; StyloExtract.Templates
   gains a direct PackageReference to SQLitePCLRaw.bundle_e_sqlite3 so
   the existing 3.0.3 central pin lifts the resolved bundle off the
   vulnerable 2.1.11 line and onto SourceGear.sqlite3 3.50.4.5.
   `dotnet list package --vulnerable` on consumer projects now
   returns clean.

PlaywrightHtmlFetcher.Dispose() (sync path)

   The fetcher previously only implemented IAsyncDisposable. When
   registered as a DI singleton (which AddStyloExtractPlaywright()
   does), `using var sp = services.BuildServiceProvider()` — the
   canonical sync pattern — threw at container shutdown:

     InvalidOperationException: 'PlaywrightHtmlFetcher' type only
     implements IAsyncDisposable. Use DisposeAsync to dispose the
     container.

   Add a sync Dispose() that block-waits on the async path. Container
   disposal happens off the request hot path so the sync wait is safe.

Both fixes are backwards-compatible drop-in patches. No code changes
needed in consumer projects beyond bumping the package version.

492 tests across 10 projects, all green.

StyloExtract 1.8.0-alpha.3 - 2026-06-25
========================================

What's new since 1.8.0-alpha.2
------------------------------

Next.js __NEXT_DATA__ rehydration extractor

   Next.js apps embed their page state in a JSON blob inside
   <script id="__NEXT_DATA__" type="application/json">. Schemas vary
   per site (Shopify Hydrogen uses pageProps.shopifyProductsPreloadedState,
   news sites use pageProps.initialState.article.body) so the
   extractor walks props.pageProps recursively and collects every
   string value that looks like prose (>= 80 chars, contains a space,
   isn't a URL / data URI / CSS variable / serialised JSON). Conservative
   key-exclusion list keeps URLs and build metadata out of the result.

   Chains next to the JSON-LD and Discourse rehydration fallbacks.

Content-role fallback gate

   The chained fallback (JSON-LD -> Next.js -> Discourse -> body-text)
   previously gated on the all-blocks text sum. That sum looked
   healthy for pages where the heuristic emitted 3 KB of nav + footer +
   boilerplate while finding zero MainContent — the renderer's
   MainContentOnly / Wcxb profiles drop those roles anyway, so the
   actual markdown is 0 chars. Switch the gate to content-role text
   mass only. 18 catastrophic pages recovered without any new code,
   just the gate change.

Playwright auto-fallback decorator

   AddStyloExtractPlaywright() wires PlaywrightHtmlFetcher AND
   decorates the existing ILayoutExtractor with a RenderingLayoutExtractor
   that runs static extraction first, then re-fetches via Playwright
   only when:
     * the caller passed a non-null sourceUri
     * the static result has < 200 chars of content-role text
     * an IRenderedHtmlFetcher is wired in DI

   File-only callers never trigger a render. Operators who don't want
   the Chromium dependency simply don't add the package. Three guards
   against wasted work: Playwright throws -> return static; rendered
   HTML same length as static -> skip the re-extract; re-extract
   yields no improvement -> return static.

   Usage:

       services.AddStyloExtract(...);
       services.AddStyloExtractPlaywright();

   492 tests across 10 projects, 6 new unit tests for the decorator
   policy.

Aggregate WCXB (1495 dev pages, Wcxb profile):

   | Stage                                  |     F1 | Catastrophic |
   |----------------------------------------|-------:|-------------:|
   | 1.8.0-alpha.2                          | 0.760 |           25 |
   | + Next.js extractor                    | same |              |
   | + content-role fallback gate           | 0.760 |           17 |
   | + 14 LLM-trained YAMLs                 | 0.760 |           17 |
   | (Playwright auto-fallback)             |   --   |              |

   Playwright auto-fallback is wired but not exercised in the WCXB
   benchmark by default — needs `playwright install chromium`. Real-
   world consumers with the package added see automatic recovery for
   JS-rendered SPAs whose content is hydrated client-side.

StyloExtract 1.8.0-alpha.2 - 2026-06-25
========================================

LLM template-training loop, Discourse rehydration, plus a stack of
heuristic + selection fixes that move the WCXB dev split from F1 0.673
(post-1.7.1, MainContentOnly profile) to F1 0.760 (Wcxb plain-text
profile, with operator-trained templates + Discourse rehydration
active). Catastrophic extraction failures (pred_chars ≤ 5) drop from
92 of 1495 pages to 25.

Beats Readability on every page type. Closes the gap to Trafilatura by
~40% on Article + Documentation. Above v1.5.4 baseline (0.718) by
+0.042 — and that's keeping all the GFM markdown structure (sidebar
TOCs, blockquotes, GFM tables) in the runtime output, not stripping
to plain text for benchmark flattery.

What's new since 1.8.0-alpha.1
------------------------------

LLM template training loop (`stylo-extract template train`)

   Operator-driven synchronous LLM template specialisation, the
   counterpart to the existing async enrichment coordinator. Smart-
   routes between induce (no template yet) and repair (template
   exists but underperforms).

   Closed-selector prompt: every selector the model can choose from
   is enumerated from the actual page DOM via DocumentSelectorCatalog
   and handed to the LLM in the prompt. Inventing selectors fails.

   Post-parse AngleSharp validation: every selector the model returns
   is run through doc.QuerySelectorAll. Selectors that match zero
   elements are dropped; templates whose MainContent rule has no
   surviving selector are rejected.

   Repair prompt re-angled as a diagnostic: "why is this failing AND
   how should it work for this page" instead of just "produce a
   corrected template."

   Hash-prefixed selectors (`#my-id`) are now properly quoted in
   emitted YAML so they round-trip; the inducer also pre-repairs
   unquoted hash selectors in the LLM response before parse.

   OllamaTextProvider bumps NumPredict default 1024 → 4096
   (reasoning-tagged models burn tokens on chain-of-thought before
   the answer) and falls back to message.thinking when message.content
   is empty.

   `template repair` command + `LlmTemplateInducer.RepairFromSkeletonAsync`
   + production coordinator dispatch (TemplateEnrichmentJob.Kind +
   LayoutExtractor enqueue on low-output existing-template hits).

Discourse data-preloaded rehydration

   Discourse renders every page as an Ember.js SPA. Static HTML ships
   near-zero post content; the actual topic + posts live in a JSON
   blob in <div id="data-preloaded" data-preloaded="...JSON...">.
   DiscourseRehydrationExtractor parses the JSON, walks
   topic_NNN.post_stream.posts[*].cooked, strips tags, and emits the
   result as a synthetic MainContent fallback block — same shape as
   the existing JSON-LD fallback. Discourse powers 5 000+ public
   forums; one upstream extractor covers them all.

   WCXB lift: 6 of 13 catastrophic forum pages go from F1=0 to
   F1=0.83–0.99. Forum category F1 0.477 → 0.535.

Wcxb plain-text profile

   WCXB-style word-overlap benchmarks score against plain-text gold.
   The default MainContentOnly / RagFull output emits GFM Markdown —
   headings, lists, sidebar TOCs, multi-paragraph blockquotes — that
   improves AI / human readability but registers as precision noise
   against plain-text comparison.

   New ExtractionProfile.Wcxb uses MainContentOnly's role-set but
   emits each block's plain Text instead of its Markdown. Strictly
   a benchmark / comparison profile — runtime callers keep their
   existing profile and continue getting structured GFM.

Heuristic + selection fixes

   DomCleaner: strip <select> globally so <option> text stops
   leaking on category dropdowns. mostlylucid.net opened with 290+
   category names dumped into the output; now opens with the actual
   blog list.

   IntraBlockCleaner: content-guard the contamination-hint substring
   match. "sidebar" substring was eating WordPress / SNOFlex article
   bodies whose class contained "sidebar-mode-single". 28 catastrophic
   article pages recovered.

   LayoutExtractor: body-text fallback for old-school flat HTML
   without <main>/<article>/section wrappers. erikdemaine.org/foldcut
   and similar plain H1/H2/P-under-body pages now extract.

   LayoutExtractor: detect chrome-heavy applicator output as bug-out.
   Stale templates applied to wrong-shape pages produced 1 char of
   MainContent while combinedText looked fine (header + footer
   selectors found chrome). esprit-barbecue, nike, rei collections
   recovered.

   HeuristicBlockClassifier: empty-semantic-wrapper handling and
   body-spanning <form> fall-through. ASP.NET WebForms pages
   (drainblasterbill, etc.) recovered.

   Framework-content-class-hints: 20 new patterns — Discourse, phpBB,
   vBulletin, PrestaShop, WooCommerce, Shopify, BigCommerce,
   Squarespace, Webflow, Wix, Joomla, GitHub Pages, plus some misc.

Benchmark harness

   WCXB harness gains --operator-templates <root> for loading
   YAML files produced by `template train`, --page-ids for fast
   repro of individual failures.

Aggregate WCXB (1495 dev pages, Wcxb profile):

   | System            |     F1 | Precision | Recall |
   |-------------------|-------:|----------:|-------:|
   | StyloExtract v1.8.0-alpha.2 | 0.760 | 0.756 | 0.849 |
   | rs-trafilatura              | 0.859 | 0.863 | 0.890 |
   | Trafilatura                 | 0.791 | 0.852 | 0.793 |
   | Readability                 | 0.675 | 0.685 | 0.713 |

Compatibility

Backwards-compatible with 1.8.0-alpha.1. All changes are either new
code paths (Discourse extractor, Wcxb profile, train CLI), strictly
better selection (the heuristic fixes), or schema-additive
(TemplateEnrichmentJob gains optional Kind / BadMarkdownSample with
default Induce). Existing operator templates and trained YAMLs from
alpha.1 continue to work unchanged.

Suite: 486 tests across 10 projects, all green.

Mostlylucid.StyloExtract.Ml 1.8.0-alpha.12

StyloExtract.Ml

net10.0

NuGet packages

GitHub repositories