TinySpider 1.0.1
dotnet add package TinySpider --version 1.0.1
NuGet\Install-Package TinySpider -Version 1.0.1
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="TinySpider" Version="1.0.1" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="TinySpider" Version="1.0.1" />
<PackageReference Include="TinySpider" />
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add TinySpider --version 1.0.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: TinySpider, 1.0.1"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package TinySpider@1.0.1
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=TinySpider&version=1.0.1
#tool nuget:?package=TinySpider&version=1.0.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
TinySpider
轻量级的网页爬虫框架
这是一个轻量级的多线程网页爬虫框架。它仅仅封装了基本的任务调度功能。本身不具备HTML代码解析功能,这需要使用者自行实现或选用一些第三方的HTML解析库。
TinySpider框架具备下以特点:
- 轻量级,源代码量非常小,框架本身不依赖其它第三方组件。
- 最小限度封装,不涉及到HTTP通信和HTML解析,虽然IDownloader下载器默认实现采用了dotnet自带的WebClient。
- 半成品,它不能在你的代码里直接new出来用。你必须实现一个IHtmlParse解析器和IPipeline管道。
- 开放式,本框架的全部接口部件均可由使用者自行实现和替换。
框架部件定义:
- IScheduler: 任务调度器,负责对线程和URL任务进行分配调度。
- IUrlStore: URL存放器,负责管理需要爬的URL。
- IDownloader: 下载器,负责下载目标URL的HTML源页面。
- IHtmlParser: 解析器,负责解析HTML页面代码,从中提取出需要爬的URL,以及页面上你关注的内容。(必须由使用者实现)
- IPipeline: 处理管线,负责接收解析器提取出来的内容。(必须由使用者实现)
- IWebProxyPool: 代理服务器池,负责管理和提供Web代理服务器。
使用例程:
实现IHtmlParser和IPipeline
/// <summary>
/// 实现HTML解析接口
/// 本例程采用HtmlAgilityPack库对html进行解析 你也可以用正规表达式或其它一些HTML处理库
/// HtmlAgilityPack具体使用方法请自行google
/// </summary>
class MyHtmlParser : IHtmlParser
{
//实现html解析
//TODO:页面上的相关超链放在Page的Links中,否则调度器将没有可用的url进行调度
public PageData Parse(Uri sourceUrl, string html, out List<Uri> links)
{
links = new List<Uri>();
var data = new PageData();
var doc = new HtmlAgilityPack.HtmlDocument();
//加载html代码到文档对象
doc.LoadHtml(html);
//提取文档中的所有超链接
var all_links = doc.DocumentNode.Descendants("a");
foreach (var item in all_links)
{
if (item.Attributes.Contains("href"))
{
//取到html标签中的href中的值,它不一定是个完整的url
string path = item.Attributes["href"].Value;
//借助Uri类对URL进行格式化整理
var uri = new Uri(baseUri: sourceUrl, relativeUri: path);
//限定一下uri范围
if (uri.AbsoluteUri.Contains("sohu.com/"))
{
//输出给调度器
links.Add(uri);
}
}
}
//提取你关注的信息内容
//寄放在Page对象的Data1、Data2中
//后续在Pipeline管线中对Data1、Data2进行存储或其它处理
data.Data1 = new Data()
{
Url = sourceUrl,
HTML = html
};
return data;
}
}
public class Data
{
public string HTML { get; set; }
public Uri Url { get; set; }
}
/// <summary>
/// 实现一个IPipeline
/// </summary>
class MyPipeline : IPipeline
{
public void FetchItem(PageData page)
{
Data data =(Data)page.Data1;
Console.WriteLine("接收到爬虫数据:" + page.SourceUrl);
}
}
运行
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello TinySpider!");
//入口点
var entryUrl = new Uri("https://www.sohu.com/");
//目标网站文本编码
var web_encode = Encoding.UTF8;
//爬虫的并发线程
var threads = 40;
var spider = new TinySpider.TinySpider(
new Scheduler(new UrlStore()) { WorkerThreads = threads },
new Downloader() { Encoding = web_encode },
new MyHtmlParser(),
new MyPipeline()
);
//启动爬虫 Run()方法会一直阻塞至所有任务完成
spider.Run(entryUrl);
Console.WriteLine("\r\n\r\n");
Console.WriteLine("Press Any Key To Exit...");
Console.ReadKey();
}
}
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
.NETStandard 2.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
free