SMEAppHouse.Core.ScraperBox 9.0.9

dotnet add package SMEAppHouse.Core.ScraperBox --version 9.0.9
                    
NuGet\Install-Package SMEAppHouse.Core.ScraperBox -Version 9.0.9
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />
                    
Directory.Packages.props
<PackageReference Include="SMEAppHouse.Core.ScraperBox" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add SMEAppHouse.Core.ScraperBox --version 9.0.9
                    
#r "nuget: SMEAppHouse.Core.ScraperBox, 9.0.9"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package SMEAppHouse.Core.ScraperBox@9.0.9
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
                    
Install as a Cake Addin
#tool nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
                    
Install as a Cake Tool

SMEAppHouse.Core.ScraperBox

Overview

SMEAppHouse.Core.ScraperBox is a library for web scraping operations. It provides utilities for fetching web pages, parsing HTML, working with proxies, handling cookies, and various helper methods for web scraping tasks.

Target Framework: .NET 8.0
Namespace: SMEAppHouse.Core.ScraperBox


Public Classes and Utilities

1. Helper (Static Class)

Main utility class for web scraping operations.

Namespace: SMEAppHouse.Core.ScraperBox

URL and HTTP Operations
ResolveHttpUrl
public static string ResolveHttpUrl(string url)

Resolves URLs that start with // to http://.

Example:

var url = Helper.ResolveHttpUrl("//example.com/page"); // Returns "http://example.com/page"
ExtractDomainNameFromUrl
public static string ExtractDomainNameFromUrl(string url, bool retainHttPrefix = false)

Extracts the domain name from a URL.

Example:

var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com" or "https://www.example.com" if retainHttPrefix is true
IsURLValid
public static bool IsURLValid(string url, bool brute = false)

Validates if a URL is valid.


Page Document Retrieval
GetPageDocument (Multiple Overloads)
public static string GetPageDocument(string site)
public static string GetPageDocument(string site, IWebProxy webProxy, ref string extraDataOnError)
public static string GetPageDocument(Uri site, ...)
public static string GetPageDocument(string sourceUrl, ref string extraDataOnError, IWebProxy webProxy = null, ...)

Retrieves HTML content from a web page with optional proxy support.

Example:

// Simple fetch
var html = Helper.GetPageDocument("https://example.com");

// With proxy
var proxy = new WebProxy("127.0.0.1", 8080);
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", proxy, ref errorData);
GetPageDocumentWithCookie
public static string GetPageDocumentWithCookie(string url)

Retrieves page content with cookie support.


HTML Processing
Resolve
public static string Resolve(string val, bool allTrim = false, params string[] otherElementsToClear)

Cleans and resolves HTML entities and encoded characters.

Example:

var cleaned = Helper.Resolve("&amp;Hello%20World", allTrim: true);
// Returns "&Hello World"
CleanupHtmlStrains
public static string CleanupHtmlStrains(string val, bool allTrim = false)

Removes HTML entities and unwanted characters.

RemoveHtmlComments
public static string RemoveHtmlComments(string sourceHtml)

Removes HTML comments from source.

RemoveUnwantedTags
public static string RemoveUnwantedTags(string data)
public static string RemoveUnwantedTags(string data, string[] acceptableTags)

Removes unwanted HTML tags, optionally keeping only specified tags.


HTML Node Operations
GetInnerText (Multiple Overloads)
public static string GetInnerText(HtmlNode sourceNode, ...)
public static string GetInnerText(HtmlNode node, params string[] tagsToRemove)
public static string GetInnerText(HtmlNode node, bool removeCommentTags = true, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, bool removeCommentTags = true, params string[] tagsToRemove)

Extracts inner text from HTML nodes or strings.

Example:

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("<div><p>Hello <b>World</b></p></div>");
var node = htmlDoc.DocumentNode.SelectSingleNode("//div");

var text = Helper.GetInnerText(node, "b"); // Returns "Hello World" (removes <b> tags)
GetNode
public static HtmlNode GetNode(HtmlNode node, ...)

Gets a specific HTML node using XPath or other selectors.

GetNodeByInnerHtml
public static HtmlNode GetNodeByInnerHtml(HtmlNode node, ...)

Finds a node by its inner HTML content.

GetNodeByAttribute
public static HtmlNode GetNodeByAttribute(HtmlNode node, ...)

Finds a node by attribute value.

GetNodeCollection
public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, ...)
public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, params string[] element)

Gets a collection of HTML nodes.


Query String Operations
EncodeQueryStringSegment
public static string EncodeQueryStringSegment(string query)

Encodes query string segments.

Example:

var encoded = Helper.EncodeQueryStringSegment("hello world & test");
// Returns "hello%20world%20%26%20test"

Proxy Operations
FindProxyCountryFromPartial
public static Rules.WorldCountriesEnum FindProxyCountryFromPartial(string countryNamePartial)

Finds a country enum value from a partial country name.

Example:

var country = Helper.FindProxyCountryFromPartial("united"); // Returns WorldCountriesEnum.UNITED_STATES

2. Models

IPProxy

Represents an IP proxy server.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Properties:

  • Guid Id - Unique identifier
  • string ProviderId - Proxy provider ID
  • string IPAddress - Proxy IP address
  • int PortNo - Proxy port number
  • Rules.WorldCountriesEnum Country - Country of proxy
  • IPProxyRules.ProxyAnonymityLevelsEnum AnonymityLevel - Anonymity level
  • IPProxyRules.ProxyProtocolsEnum Protocol - Protocol (HTTP, HTTPS, SOCKS)
  • DateTime LastChecked - Last validation time
  • int ResponseRate - Response rate percentage
  • int SpeedRate - Speed in milliseconds
  • TimeSpan SpeedTimeSpan - Speed as TimeSpan
  • string ISP - Internet Service Provider
  • string City - City location
  • IPProxyRules.ProxySpeedsEnum Speed - Speed category
  • IPProxyRules.ProxyConnectionSpeedsEnum ConnectionTime - Connection speed
  • Guid CheckerTokenId - Checker token identifier
  • CheckStatusEnum CheckStatus - Current check status
  • Tuple<string, string> Credential - Username/password credentials

Methods:

ToWebProxy
public IWebProxy ToWebProxy()

Converts to IWebProxy for use with HTTP clients.

ToNetworkCredential
public NetworkCredential ToNetworkCredential()

Converts credentials to NetworkCredential.

AsTuple
public Tuple<string, string> AsTuple()

Returns IP and port as a tuple.

GetLastValidationElapsedTime
public TimeSpan GetLastValidationElapsedTime()

Gets time elapsed since last validation.

CheckStatusEnum:

  • NotChecked
  • Checking
  • Checked
  • CheckedInvalid

Example:

var proxy = new IPProxy
{
    IPAddress = "192.168.1.1",
    PortNo = 8080,
    Country = Rules.WorldCountriesEnum.UNITED_STATES,
    AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
    Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
    Credential = new Tuple<string, string>("username", "password")
};

// Use with HTTP client
var webProxy = proxy.ToWebProxy();
var credential = proxy.ToNetworkCredential();

PageInstruction

Represents pagination instructions for URL construction.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Properties:

  • char PadCharacter - Character used for padding
  • int PadLength - Length of padding
  • PaddingDirectionsEnum PaddingDirection - Direction of padding (Left or Right)

PaddingDirectionsEnum:

  • ToLeft - Pad to the left
  • ToRight - Pad to the right

Extension Method:

PageNo
public static string PageNo(this PageInstruction pgInstruction, int pageNo)

Formats a page number according to the instruction.

Example:

var instruction = new PageInstruction
{
    PadCharacter = '0',
    PadLength = 3,
    PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};

var pageNumber = instruction.PageNo(5); // Returns "005"

UserAgents

Type-safe enum pattern for user agent strings.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Static Properties:

  • UserAgents Mozilla22
  • UserAgents FireFox36
  • UserAgents FireFox33
  • UserAgents Chrome41022280
  • UserAgents InternetExplorer8

Methods:

GetFakeUserAgent
public static FakeUserAgent GetFakeUserAgent(UserAgents userAgent)

Gets the fake user agent string.

Example:

var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);
// Returns "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36..."

AuthenticationMethod

Type-safe enum for authentication methods.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Static Properties:

  • AuthenticationMethod FORMS
  • AuthenticationMethod WINDOWSAUTHENTICATION
  • AuthenticationMethod SINGLESIGNON

3. Rules and Enums

HttpOpsRules

HTTP operation rules and constants.

Namespace: SMEAppHouse.Core.ScraperBox

HttpMethodConsts Enum:

  • GET, POST, PUT, HEAD, TRACE, DELETE, SEARCH, CONNECT, PROPFIND, PROPPATCH, PATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK, OPTIONS

ContentTypeConsts Enum:

  • Xml
  • Json

IPProxyRules

IP proxy rules and enums.

Namespace: SMEAppHouse.Core.ScraperBox

ProxyAnonymityLevelsEnum:

  • Elite - Highly anonymous (Level 1)
  • Anonymous - Anonymous (Level 2)
  • Transparent - Transparent (Level 3)

ProxySpeedsEnum:

  • Slow
  • Medium
  • Fast

ProxyConnectionSpeedsEnum:

  • Slow
  • Medium
  • Fast

ProxyProtocolsEnum:

  • HTTP
  • HTTPS
  • SOCKS4_5

Complete Usage Examples

Example 1: Basic Web Scraping

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

// Fetch a web page
var html = Helper.GetPageDocument("https://example.com");

// Parse HTML
var doc = new HtmlDocument();
doc.LoadHtml(html);

// Extract data
var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
var links = doc.DocumentNode.SelectNodes("//a[@href]")
    .Select(a => a.GetAttributeValue("href", ""))
    .ToList();

Example 2: Scraping with Proxy

using SMEAppHouse.Core.ScraperBox;
using SMEAppHouse.Core.ScraperBox.Models;

// Create proxy
var proxy = new IPProxy
{
    IPAddress = "192.168.1.1",
    PortNo = 8080,
    Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
    AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
    Credential = new Tuple<string, string>("user", "pass")
};

// Fetch with proxy
var webProxy = proxy.ToWebProxy();
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", webProxy, ref errorData);

if (!string.IsNullOrEmpty(errorData))
{
    Console.WriteLine($"Error: {errorData}");
}

Example 3: HTML Processing

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

var html = "<div><p>Hello <b>World</b> &amp; Friends</p></div>";

// Clean HTML
var cleaned = Helper.CleanupHtmlStrains(html, allTrim: true);

// Remove comments
var noComments = Helper.RemoveHtmlComments(html);

// Remove unwanted tags
var textOnly = Helper.RemoveUnwantedTags(html, new[] { "p" }); // Keep only <p> tags

// Get inner text
var doc = new HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectSingleNode("//div");
var innerText = Helper.GetInnerText(node, "b"); // Removes <b> tags

Example 4: Pagination

using SMEAppHouse.Core.ScraperBox.Models;

var instruction = new PageInstruction
{
    PadCharacter = '0',
    PadLength = 3,
    PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};

// Generate paginated URLs
for (int i = 1; i <= 10; i++)
{
    var pageNumber = instruction.PageNo(i); // "001", "002", ..., "010"
    var url = $"https://example.com/page/{pageNumber}";
    Console.WriteLine(url);
}

Example 5: Node Collection Extraction

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

var html = Helper.GetPageDocument("https://example.com/products");
var doc = new HtmlDocument();
doc.LoadHtml(html);

// Get all product nodes
var products = Helper.GetNodeCollection(doc.DocumentNode, "div", "class", "product");

foreach (var product in products)
{
    var name = Helper.GetInnerText(product, "h2");
    var price = Helper.GetInnerText(product, "span", "class", "price");
    Console.WriteLine($"{name}: {price}");
}

Example 6: URL Processing

using SMEAppHouse.Core.ScraperBox;

// Resolve relative URLs
var url1 = Helper.ResolveHttpUrl("//example.com/page"); // "http://example.com/page"
var url2 = Helper.ResolveHttpUrl("https://example.com/page"); // "https://example.com/page"

// Extract domain
var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com"

// Validate URL
bool isValid = Helper.IsURLValid("https://example.com");

// Encode query string
var encoded = Helper.EncodeQueryStringSegment("search query & filter");
// Returns "search%20query%20%26%20filter"

Example 7: User Agent Usage

using SMEAppHouse.Core.ScraperBox.Models;
using System.Net.Http;

var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);

var client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", userAgent.UserAgent);

var response = await client.GetAsync("https://example.com");

Key Features

  1. Web Page Fetching: Multiple methods for retrieving web page content
  2. Proxy Support: Full support for HTTP proxies with authentication
  3. HTML Parsing: Integration with HtmlAgilityPack for DOM manipulation
  4. HTML Cleaning: Utilities for cleaning and processing HTML
  5. Node Operations: Methods for finding and extracting HTML nodes
  6. Pagination Support: Helper for constructing paginated URLs
  7. User Agent Management: Predefined user agent strings
  8. URL Processing: Utilities for URL manipulation and validation

Dependencies

  • HtmlAgilityPack (v1.12.3)
  • ScrapySharp (v3.0.0)
  • SMEAppHouse.Core.CodeKits

Notes

  • Uses HtmlAgilityPack for HTML parsing
  • Proxy support includes authentication via credentials
  • All HTML operations are case-sensitive for tag names
  • XPath expressions are supported for node selection
  • User agent strings are predefined for common browsers
  • PageInstruction format: '{padChar}-{padLength}-{direction}' where direction is 0 (left) or 1 (right)
  • Proxy protocols support HTTP, HTTPS, and SOCKS4/5
  • Anonymity levels indicate how well the proxy hides your IP

License

Copyright © SME App House 2025

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (2)

Showing the top 2 NuGet packages that depend on SMEAppHouse.Core.ScraperBox:

Package Downloads
SMEAppHouse.Core.ScraperBox.Selenium

Library for handling Selenium functionalities.

SMEAppHouse.Core.FreeIPProxy

Library for generating usable proxy IP usable when making http requests anonymously.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
9.0.9 142 11/29/2025
9.0.8 142 11/29/2025
9.0.7 115 11/29/2025
1.4.1906.15 828 6/12/2019
1.4.1906.14 785 6/9/2019
1.4.1811.13 738 6/9/2019

release notes