CocoCrawler 0.1.2
See the version list below for details.
dotnet add package CocoCrawler --version 0.1.2
NuGet\Install-Package CocoCrawler -Version 0.1.2
<PackageReference Include="CocoCrawler" Version="0.1.2" />
<PackageVersion Include="CocoCrawler" Version="0.1.2" />
<PackageReference Include="CocoCrawler" />
paket add CocoCrawler --version 0.1.2
#r "nuget: CocoCrawler, 0.1.2"
#:package CocoCrawler@0.1.2
#addin nuget:?package=CocoCrawler&version=0.1.2
#tool nuget:?package=CocoCrawler&version=0.1.2
CocoCrawler
Overview
CocoCrawler is an easy to use web crawler, scraper and parser in C#. By combing PuppeteerSharp and AngleSharp it brings the best of both sides, and merges them into an easy to use API.
It provides an simple API to get started
var crawlerEngine = await new CrawlerEngineBuilder()
.AddPage("https://old.reddit.com/r/csharp", pageOptions => pageOptions
.ExtractList(containersSelector: "div.thing.link.self", [
new("Title","a.title"),
new("Upvotes", "div.score.unvoted"),
new("Datetime", "time", "datetime"),
new("Total Comments","a.comments"),
new("Url","a.title", "href")
])
.AddPagination("span.next-button > a.not-exist", newPage => newPage.ScrollToEnd())
.AddOutputToConsole()
.AddOutputToCsvFile("results.csv")
)
.ConfigureEngine(options =>
{
options.UseHeadlessMode(false);
options.WithLoggerFactory(loggerFactory);
})
.BuildAsync(cancellationToken);
await crawlerEngine.RunAsync(cancellationToken);
This examples starts at page https://old.reddit.com/r/csharp scrapes all the posts, then continues to the next page and scrapes everything again, and on and on.
With this library it's easy to
- Scrape Single Page Apps
- Scrape Listings
- Add pagination
- Alternative to list is open each post and scrape the page and continue with pagination
- Scrape multiple pages in parallel
- Add custom outputs
- Customize Everything
Scraping pages
With each Page (a page a is a single URL job) added it's possible to add a Task. For each Page it's possible to call:
.ExtractObject(...).ExtractList(...).OpenLinks(...).AddPagination(...)
It's possible to add multiple pages to scrape with the same Tasks.
var crawlerEngine = await new CrawlerEngineBuilder()
.AddPages(["https://old.reddit.com/r/csharp", "https://old.reddit.com/r/dotnet"], pageOptions => pageOptions
.OpenLinks("div.thing.link.self a.bylink.comments", subPageOptions =>
{
subPageOptions.ExtractObject([
new("Title","div.sitetable.linklisting a.title"),
new("Url","div.sitetable.linklisting a.title", "href"),
new("Upvotes", "div.sitetable.linklisting div.score.unvoted"),
new("Top comment", "div.commentarea div.entry.unvoted div.md"),
]);
subPageOptions.ConfigurePageActions(ops =>
{
ops.ScrollToEnd();
ops.Wait(4000);
});
})
.AddPagination("span.next-button > a")
.AddOutputToConsole()
.AddOutputToCsvFile("results.csv"))
.BuildAsync(cancellationToken);
await crawlerEngine.RunAsync(cancellationToken);
This example starts at https://old.reddit.com/r/csharp and https://old.reddit.com/r/dotnet and opens each post and scrapes the title, url, upvotes and top comment. It also scrolls to the end of the page and waits 4 seconds before scraping the page. And it continues to the next page.
Configuring the Engine
The engine can be configured with the following options:
UseHeadlessMode(bool headless): If the browser should be headless or notWithLoggerFactory(ILoggerFactory loggerFactory): The logger factory to useWithUserAgent(string userAgent): The user agent to useWithCookies(params Cookie[] cookies): The cookies to useTotalPagesToCrawl(int total): The total number of pages to crawlWithParallelismDegree(int parallelismDegree): The number of pages to crawl in parallel
Stopping the engine
The engine stops when the
- The total number of pages to crawl is reached.
- 2 minutes have passed since the last job was added
Extensibility
The library is designed to be extensible. It's possible to add custom IParser, IScheduler and ICrawler implementations.
using the engine builder it's possible to add custom implementations
.ConfigureEngine(options =>
{
options.WithCrawler(new MyCustomCrawler());
options.WithScheduler(new MyCustomScheduler());
options.WithParser(new MyCustomParser());
})
Custom Outputs
It's possible to add custom outputs by implementing the ICrawlOutput interface.
ICrawlOutput.WriteAsync(JObject jObject, CancellationToken cancellationToken); is called for each object that is scraped.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- AngleSharp (>= 1.1.2)
- PuppeteerSharp (>= 18.0.2)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.