WebReaper.Extraction.Generators
11.3.0
dotnet add package WebReaper.Extraction.Generators --version 11.3.0
NuGet\Install-Package WebReaper.Extraction.Generators -Version 11.3.0
<PackageReference Include="WebReaper.Extraction.Generators" Version="11.3.0"> <PrivateAssets>all</PrivateAssets> <IncludeAssets>runtime; build; native; contentfiles; analyzers</IncludeAssets> </PackageReference>
<PackageVersion Include="WebReaper.Extraction.Generators" Version="11.3.0" />
<PackageReference Include="WebReaper.Extraction.Generators"> <PrivateAssets>all</PrivateAssets> <IncludeAssets>runtime; build; native; contentfiles; analyzers</IncludeAssets> </PackageReference>
paket add WebReaper.Extraction.Generators --version 11.3.0
#r "nuget: WebReaper.Extraction.Generators, 11.3.0"
#:package WebReaper.Extraction.Generators@11.3.0
#addin nuget:?package=WebReaper.Extraction.Generators&version=11.3.0
#tool nuget:?package=WebReaper.Extraction.Generators&version=11.3.0
WebReaper.Extraction.Generators
Roslyn source generator that emits a static Schema and a reflection-free static Materialize method on partial classes marked with [ScrapeSchema]. The .NET-native structural differentiator (REPOSITIONING-PLAN §2.3): Pydantic-parity that Python's runtime reflection structurally cannot match.
Install
You usually want both packages together (this one is a compile-time analyzer; the attributes ship in a sibling package):
dotnet add package WebReaper.Extraction.Generators
dotnet add package WebReaper.Extraction.Attributes
WebReaper.Extraction.Generators is a DevelopmentDependency=true analyzer; it does not propagate to your project's runtime dependency graph.
What's emitted
For each class marked with [ScrapeSchema], the generator emits:
public partial class Article
{
public static Schema Schema { get; }
public static Article Materialize(JsonObject json);
}
Schema is built once at compile time from the [ScrapeField] attributes on the class's properties. Materialize is reflection-free; the AOT publish trims and inlines it.
Quick start
using WebReaper.Extraction.Attributes;
using WebReaper.Builders;
[ScrapeSchema]
public partial class Article
{
[ScrapeField("h1")] public string? Title { get; set; }
[ScrapeField(".views", Type = SchemaFieldType.Integer)] public int Views { get; set; }
[ScrapeField(".tag", IsList = true)] public List<string> Tags { get; set; } = new();
}
var engine = await ScraperEngineBuilder
.Crawl("https://example.com/post")
.Extract(Article.Schema)
.Subscribe(p => HandleArticle(Article.Materialize(p.Data)))
.BuildAsync();
v1 scope
Common case only:
- Single-level schemas
- Primitive fields (
string,int,bool,DateTime,float) List<T>of primitives
Nested [ScrapeSchema] types are explicitly deferred to a future version. The attributes package supports the syntax; the generator does not yet emit code for nested classes.
See also
- Main repo: github.com/pavlovtech/WebReaper
- The attributes:
WebReaper.Extraction.Attributes - Design: ADR-0045
- License: MIT
Learn more about Target Frameworks and .NET Standard.
-
.NETStandard 2.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
10.0.1: NuGet metadata polish. Adds PackageIcon + PackageReadmeFile so the package displays a logo and README on its NuGet page. Removes em-dashes from Description and release notes. No code changes. 10.0.0: initial release. Roslyn IIncrementalGenerator (ADR-0045) emitting a compile-time `static Schema Schema` and a reflection-free `static Materialize(JsonObject)` on partial classes marked with [ScrapeSchema]. AOT-clean: no reflection, no dynamic; the source-generator runs at compile time and emits ordinary C# the AOT publish trims and inlines. The .NET-native structural differentiator (REPOSITIONING-PLAN §2.3): Pydantic-parity Python cannot match. v1 ships the common case: single-level schemas, primitive fields, List<T> of primitives. Nested [ScrapeSchema] POCOs are explicitly deferred. Pairs with WebReaper.Extraction.Attributes; requires WebReaper 10.0.0 at runtime.