CommonCrawl.Parquet
1.1.0
dotnet add package CommonCrawl.Parquet --version 1.1.0
NuGet\Install-Package CommonCrawl.Parquet -Version 1.1.0
<PackageReference Include="CommonCrawl.Parquet" Version="1.1.0" />
<PackageVersion Include="CommonCrawl.Parquet" Version="1.1.0" />
<PackageReference Include="CommonCrawl.Parquet" />
paket add CommonCrawl.Parquet --version 1.1.0
#r "nuget: CommonCrawl.Parquet, 1.1.0"
#:package CommonCrawl.Parquet@1.1.0
#addin nuget:?package=CommonCrawl.Parquet&version=1.1.0
#tool nuget:?package=CommonCrawl.Parquet&version=1.1.0
CommonCrawl.Parquet
A .NET library for reading Common Crawl index data stored in Parquet format. This library provides strongly-typed models and an efficient reader to process Common Crawl index records.
Features
- Strongly Typed Models: Maps Common Crawl Parquet schema to the
IndexTableRecordC# class. - Efficient Reading: Uses
ParquetReaderto read Parquet files asynchronously. - Filtering: Supports predicates to filter records while reading.
Installation
Install the package via NuGet:
dotnet add package CommonCrawl.Parquet
Usage
You can use ParquetReader.Instance to read Parquet files. The reader returns an IAsyncEnumerable<T>, allowing for memory-efficient processing.
using CommonCrawl.Readers;
using CommonCrawl.Models;
// Read from a file path
var reader = ParquetReader.Instance;
string filePath = "path/to/cc-index.parquet";
await foreach (var record in reader.ReadAsAsyncEnumerable<IndexTableRecord>(filePath))
{
Console.WriteLine($"URL: {record.Url}, Fetch Time: {record.FetchTime}");
}
// Read with a filter (e.g., only successful fetches)
await foreach (var record in reader.ReadAsAsyncEnumerable<IndexTableRecord>(filePath, r => r.FetchStatus == 200))
{
Console.WriteLine($"Found valid URL: {record.Url}");
}
Models
IndexTableRecord
Represents a single record in the Common Crawl index. Key properties include:
Url: The full URL string.UrlSurtKey: SURT URL key for canonicalization.UrlHostName: Hostname of the URL.FetchTime: Timestamp of the capture.FetchStatus: HTTP status code.ContentMimeType: MIME type of the content.WarcFilename: Location of the WARC file in Common Crawl's S3 bucket.WarcRecordOffset&WarcRecordLength: Position of the record in the WARC file.
For a full list of fields, refer to the source code or the Common Crawl Index Schema.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Parquet.Net (>= 5.4.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.