ReadableWeb 1.1.23

There is a newer version of this package available.
See the version list below for details.
dotnet add package ReadableWeb --version 1.1.23
                    
NuGet\Install-Package ReadableWeb -Version 1.1.23
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ReadableWeb" Version="1.1.23" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ReadableWeb" Version="1.1.23" />
                    
Directory.Packages.props
<PackageReference Include="ReadableWeb" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ReadableWeb --version 1.1.23
                    
#r "nuget: ReadableWeb, 1.1.23"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ReadableWeb@1.1.23
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ReadableWeb&version=1.1.23
                    
Install as a Cake Addin
#tool nuget:?package=ReadableWeb&version=1.1.23
                    
Install as a Cake Tool

ReadableWeb

Overview

Small, modular .NET 10 library and tools to extract article content, images and metadata from web pages. The solution contains extractors, abstractions, HTML parsing implementations, tests and benchmarks.

Projects

  • ReadableWeb.Abstractions � public interfaces for extractors and processors
  • ReadableWeb � composition and higher-level services
  • ReadableWeb.HtmlAgilityPack � HTML Agility Pack based extractor implementation
  • ReadableWeb.AngleSharp � AngleSharp based extractor implementation
  • ReadableWeb.Tests � unit tests
  • ReadableWeb.Benchmarks � benchmark projects
  • ReadableWeb.TestConsole � sample/test console app

Requirements

  • .NET 10 SDK
  • Optional: dotnet-ef or other tooling only if needed for local tasks

Usage examples

Quick extraction via the default HTTP helper:

using ReadableWeb.Extraction;

var extractor = HttpArticleExtractor.CreateDefault();
var article = await extractor.ExtractFromUrlAsync(
    "https://www.example.com/news/story");

Console.WriteLine(article.Title);
Console.WriteLine(article.Excerpt);
foreach (var image in article.Images)
{
    Console.WriteLine($"Image: {image.Url}");
}

Register the library in an ASP.NET Core or worker service using the provided DI extension:

using ReadableWeb;
using ReadableWeb.Configuration;

builder.Services.AddArticleExtraction(builder.Configuration, "ArticleExtraction", options =>
{
    options.EnableImageFileCache = true;
    options.ImageFileCachePath = Path.Combine(builder.Environment.ContentRootPath, "wwwroot/images");
});

Caching

ReadableWeb supports two complementary caching mechanisms to improve performance and reduce bandwidth when extracting articles with images:

  1. Image file cache (local filesystem)
  • Enable this with EnableImageFileCache = true in the configuration or by setting the option in the DI extension. When enabled the extractor will download image assets referenced by the extracted article and persist them to the specified ImageFileCachePath on disk.
  • ImageFileCacheBaseUrl should be set to the public base URL segment where the cached images will be served from (for example /article-images). The extractor will rewrite image URLs in the returned ArticleContent to point at the base URL plus the cached filename.
  • ImageFileCachePath is a filesystem path relative to your application root (or an absolute path). Make sure the directory is writable by the process and is served by your web server (for example, place it under wwwroot in ASP.NET Core or configure static file serving for that path).
  • IgnoreImageDownloadErrors = true prevents extraction from failing when individual image downloads fail. When false, image download failures may surface as errors.

Example configuration for local image caching:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": false,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}
  1. Distributed cache (Redis)
  • If UseRedis = true the library will use the configured distributed cache (typically backed by Redis) to store extract results and/or intermediate data depending on your configuration. This reduces repeated extraction work for the same URLs across multiple instances.
  • To use Redis, register and configure IDistributedCache (for example Microsoft.Extensions.Caching.StackExchangeRedis) in your application and make sure the ArticleExtraction configuration section enables UseRedis.

Example configuration snippet enabling Redis + image cache:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": true,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}

Behavior notes and operational tips

  • Cached images are intended to be served as static files. Ensure your web server serves files from ImageFileCachePath at the ImageFileCacheBaseUrl you configured.
  • The library may use deterministic filenames (for example based on a hash of the original image URL) � treat the cache directory as opaque and prefer clearing it via administrative processes when you need to invalidate images.
  • If you enable a distributed cache (Redis) you should configure an appropriate TTL and eviction policy on the cache side if you need automatic expiration; the library relies on the configured IDistributedCache implementation for storage semantics.
  • When IgnoreImageDownloadErrors is enabled the extraction will still return article text and metadata even if some images fail to download. Disable this setting during debugging to surface download issues.

Expose the extractor through a minimal API endpoint:

using ReadableWeb.Extraction;

var app = builder.Build();

app.MapGet("/api/article", async (
        [FromServices] IHttpArticleExtractor extractor,
        [FromQuery] string url,
        CancellationToken token) =>
    {
        if (string.IsNullOrWhiteSpace(url))
        {
            return Results.BadRequest("Url query parameter is required.");
        }

        var article = await extractor.ExtractFromUrlAsync(url, cancellationToken: token);
        return Results.Ok(article);
    })
   .Produces<ArticleContent>()
   .WithName("GetArticle");

app.Run();

Or wire it up in a conventional API controller:

using ReadableWeb.Extraction;
using Microsoft.AspNetCore.Mvc;

[ApiController]
[Route("api/[controller]")]
public class ArticleController : ControllerBase
{
    private readonly IHttpArticleExtractor _extractor;

    public ArticleController(IHttpArticleExtractor extractor)
    {
        _extractor = extractor;
    }

    [HttpGet]
    public async Task<IActionResult> Get([FromQuery] string url, CancellationToken token)
    {
        if (string.IsNullOrWhiteSpace(url))
        {
            return BadRequest("Url query parameter is required.");
        }

        var article = await _extractor.ExtractFromUrlAsync(url, cancellationToken: token);
        return Ok(article);
    }
}

Example configuration section consumed by AddArticleExtraction:

{
  "ArticleExtraction": {
    "Parser": "HtmlAgilityPack",
    "UseRedis": false,
    "EnableImageFileCache": true,
    "ImageFileCachePath": "wwwroot/article-images",
    "ImageFileCacheBaseUrl": "/article-images",
    "IgnoreImageDownloadErrors": true
  }
}

Build

Restore and build all projects:

dotnet restore
dotnet build --configuration Release

Run tests

Run unit tests from solution root:

dotnet test

Run benchmarks

Benchmarks use BenchmarkDotNet. Run from the benchmark project directory:

dotnet run -c Release -p ReadableWeb.Benchmarks

Package and publish

This repository includes a GitHub Actions workflow to pack and publish NuGet packages: .github/workflows/publish-nuget.yml. The workflow builds and packs with a version based on commit count and pushes packages to NuGet when the NUGET_API_KEY secret is provided.

Dependency updates

Dependabot configuration is provided in .github/dependabot.yml to open weekly PRs for NuGet package updates.

Contributing

  • Open issues or PRs for bugs and improvements
  • Follow existing coding conventions in the repository
  • Update or add tests for behavior changes

License

No license file included in the repository. Add a LICENSE file if you intend to open source this code.

Contact

For local development questions, run the sample console app ReadableWeb.TestConsole or inspect tests in ReadableWeb.Tests.

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.1.25 121 12/21/2025
1.1.23 116 12/21/2025
1.1.21 119 12/21/2025
1.1.20 122 12/21/2025
1.1.19 120 12/21/2025