SMEAppHouse.Core.ScraperBox 9.0.9

.NET 8.0

dotnet add package SMEAppHouse.Core.ScraperBox --version 9.0.9

NuGet\Install-Package SMEAppHouse.Core.ScraperBox -Version 9.0.9

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />
                    

                            Directory.Packages.props

<PackageReference Include="SMEAppHouse.Core.ScraperBox" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add SMEAppHouse.Core.ScraperBox --version 9.0.9

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: SMEAppHouse.Core.ScraperBox, 9.0.9"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package SMEAppHouse.Core.ScraperBox@9.0.9

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
                    

                            Install as a Cake Addin

#tool nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

SMEAppHouse.Core.ScraperBox

Overview

SMEAppHouse.Core.ScraperBox is a library for web scraping operations. It provides utilities for fetching web pages, parsing HTML, working with proxies, handling cookies, and various helper methods for web scraping tasks.

Target Framework: .NET 8.0
Namespace: SMEAppHouse.Core.ScraperBox

Public Classes and Utilities

1. Helper (Static Class)

Main utility class for web scraping operations.

Namespace: SMEAppHouse.Core.ScraperBox

URL and HTTP Operations

ResolveHttpUrl

public static string ResolveHttpUrl(string url)

Resolves URLs that start with // to http://.

Example:

var url = Helper.ResolveHttpUrl("//example.com/page"); // Returns "http://example.com/page"

ExtractDomainNameFromUrl

public static string ExtractDomainNameFromUrl(string url, bool retainHttPrefix = false)

Extracts the domain name from a URL.

Example:

var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com" or "https://www.example.com" if retainHttPrefix is true

IsURLValid

public static bool IsURLValid(string url, bool brute = false)

Validates if a URL is valid.

Page Document Retrieval

GetPageDocument (Multiple Overloads)

public static string GetPageDocument(string site)
public static string GetPageDocument(string site, IWebProxy webProxy, ref string extraDataOnError)
public static string GetPageDocument(Uri site, ...)
public static string GetPageDocument(string sourceUrl, ref string extraDataOnError, IWebProxy webProxy = null, ...)

Retrieves HTML content from a web page with optional proxy support.

Example:

// Simple fetch
var html = Helper.GetPageDocument("https://example.com");

// With proxy
var proxy = new WebProxy("127.0.0.1", 8080);
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", proxy, ref errorData);

GetPageDocumentWithCookie

public static string GetPageDocumentWithCookie(string url)

Retrieves page content with cookie support.

HTML Processing

Resolve

public static string Resolve(string val, bool allTrim = false, params string[] otherElementsToClear)

Cleans and resolves HTML entities and encoded characters.

Example:

var cleaned = Helper.Resolve("&amp;Hello%20World", allTrim: true);
// Returns "&Hello World"

CleanupHtmlStrains

public static string CleanupHtmlStrains(string val, bool allTrim = false)

Removes HTML entities and unwanted characters.

RemoveHtmlComments

public static string RemoveHtmlComments(string sourceHtml)

Removes HTML comments from source.

RemoveUnwantedTags

public static string RemoveUnwantedTags(string data)
public static string RemoveUnwantedTags(string data, string[] acceptableTags)

Removes unwanted HTML tags, optionally keeping only specified tags.

HTML Node Operations

GetInnerText (Multiple Overloads)

public static string GetInnerText(HtmlNode sourceNode, ...)
public static string GetInnerText(HtmlNode node, params string[] tagsToRemove)
public static string GetInnerText(HtmlNode node, bool removeCommentTags = true, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, bool removeCommentTags = true, params string[] tagsToRemove)

Extracts inner text from HTML nodes or strings.

Example:

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("<div><p>Hello <b>World</b></p></div>");
var node = htmlDoc.DocumentNode.SelectSingleNode("//div");

var text = Helper.GetInnerText(node, "b"); // Returns "Hello World" (removes <b> tags)

GetNode

public static HtmlNode GetNode(HtmlNode node, ...)

Gets a specific HTML node using XPath or other selectors.

GetNodeByInnerHtml

public static HtmlNode GetNodeByInnerHtml(HtmlNode node, ...)

Finds a node by its inner HTML content.

GetNodeByAttribute

public static HtmlNode GetNodeByAttribute(HtmlNode node, ...)

Finds a node by attribute value.

GetNodeCollection

public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, ...)
public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, params string[] element)

Gets a collection of HTML nodes.

Query String Operations

EncodeQueryStringSegment

public static string EncodeQueryStringSegment(string query)

Encodes query string segments.

Example:

var encoded = Helper.EncodeQueryStringSegment("hello world & test");
// Returns "hello%20world%20%26%20test"

Proxy Operations

FindProxyCountryFromPartial

public static Rules.WorldCountriesEnum FindProxyCountryFromPartial(string countryNamePartial)

Finds a country enum value from a partial country name.

Example:

var country = Helper.FindProxyCountryFromPartial("united"); // Returns WorldCountriesEnum.UNITED_STATES

2. Models

IPProxy

Represents an IP proxy server.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Properties:

Guid Id - Unique identifier
string ProviderId - Proxy provider ID
string IPAddress - Proxy IP address
int PortNo - Proxy port number
Rules.WorldCountriesEnum Country - Country of proxy
IPProxyRules.ProxyAnonymityLevelsEnum AnonymityLevel - Anonymity level
IPProxyRules.ProxyProtocolsEnum Protocol - Protocol (HTTP, HTTPS, SOCKS)
DateTime LastChecked - Last validation time
int ResponseRate - Response rate percentage
int SpeedRate - Speed in milliseconds
TimeSpan SpeedTimeSpan - Speed as TimeSpan
string ISP - Internet Service Provider
string City - City location
IPProxyRules.ProxySpeedsEnum Speed - Speed category
IPProxyRules.ProxyConnectionSpeedsEnum ConnectionTime - Connection speed
Guid CheckerTokenId - Checker token identifier
CheckStatusEnum CheckStatus - Current check status
Tuple<string, string> Credential - Username/password credentials

Methods:

ToWebProxy

public IWebProxy ToWebProxy()

Converts to IWebProxy for use with HTTP clients.

ToNetworkCredential

public NetworkCredential ToNetworkCredential()

Converts credentials to NetworkCredential.

AsTuple

public Tuple<string, string> AsTuple()

Returns IP and port as a tuple.

GetLastValidationElapsedTime

public TimeSpan GetLastValidationElapsedTime()

Gets time elapsed since last validation.

CheckStatusEnum:

NotChecked
Checking
Checked
CheckedInvalid

Example:

var proxy = new IPProxy
{
    IPAddress = "192.168.1.1",
    PortNo = 8080,
    Country = Rules.WorldCountriesEnum.UNITED_STATES,
    AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
    Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
    Credential = new Tuple<string, string>("username", "password")
};

// Use with HTTP client
var webProxy = proxy.ToWebProxy();
var credential = proxy.ToNetworkCredential();

PageInstruction

Represents pagination instructions for URL construction.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Properties:

char PadCharacter - Character used for padding
int PadLength - Length of padding
PaddingDirectionsEnum PaddingDirection - Direction of padding (Left or Right)

PaddingDirectionsEnum:

ToLeft - Pad to the left
ToRight - Pad to the right

Extension Method:

PageNo

public static string PageNo(this PageInstruction pgInstruction, int pageNo)

Formats a page number according to the instruction.

Example:

var instruction = new PageInstruction
{
    PadCharacter = '0',
    PadLength = 3,
    PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};

var pageNumber = instruction.PageNo(5); // Returns "005"

UserAgents

Type-safe enum pattern for user agent strings.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Static Properties:

UserAgents Mozilla22
UserAgents FireFox36
UserAgents FireFox33
UserAgents Chrome41022280
UserAgents InternetExplorer8

Methods:

GetFakeUserAgent

public static FakeUserAgent GetFakeUserAgent(UserAgents userAgent)

Gets the fake user agent string.

Example:

var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);
// Returns "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36..."

AuthenticationMethod

Type-safe enum for authentication methods.

Namespace: SMEAppHouse.Core.ScraperBox.Models

Static Properties:

AuthenticationMethod FORMS
AuthenticationMethod WINDOWSAUTHENTICATION
AuthenticationMethod SINGLESIGNON

3. Rules and Enums

HttpOpsRules

HTTP operation rules and constants.

Namespace: SMEAppHouse.Core.ScraperBox

HttpMethodConsts Enum:

GET, POST, PUT, HEAD, TRACE, DELETE, SEARCH, CONNECT, PROPFIND, PROPPATCH, PATCH, MKCOL, COPY, MOVE, LOCK, UNLOCK, OPTIONS

ContentTypeConsts Enum:

Xml
Json

IPProxyRules

IP proxy rules and enums.

Namespace: SMEAppHouse.Core.ScraperBox

ProxyAnonymityLevelsEnum:

Elite - Highly anonymous (Level 1)
Anonymous - Anonymous (Level 2)
Transparent - Transparent (Level 3)

ProxySpeedsEnum:

Slow
Medium
Fast

ProxyConnectionSpeedsEnum:

Slow
Medium
Fast

ProxyProtocolsEnum:

HTTP
HTTPS
SOCKS4_5

Complete Usage Examples

Example 1: Basic Web Scraping

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

// Fetch a web page
var html = Helper.GetPageDocument("https://example.com");

// Parse HTML
var doc = new HtmlDocument();
doc.LoadHtml(html);

// Extract data
var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
var links = doc.DocumentNode.SelectNodes("//a[@href]")
    .Select(a => a.GetAttributeValue("href", ""))
    .ToList();

Example 2: Scraping with Proxy

using SMEAppHouse.Core.ScraperBox;
using SMEAppHouse.Core.ScraperBox.Models;

// Create proxy
var proxy = new IPProxy
{
    IPAddress = "192.168.1.1",
    PortNo = 8080,
    Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
    AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
    Credential = new Tuple<string, string>("user", "pass")
};

// Fetch with proxy
var webProxy = proxy.ToWebProxy();
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", webProxy, ref errorData);

if (!string.IsNullOrEmpty(errorData))
{
    Console.WriteLine($"Error: {errorData}");
}

Example 3: HTML Processing

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

var html = "<div><p>Hello <b>World</b> &amp; Friends</p></div>";

// Clean HTML
var cleaned = Helper.CleanupHtmlStrains(html, allTrim: true);

// Remove comments
var noComments = Helper.RemoveHtmlComments(html);

// Remove unwanted tags
var textOnly = Helper.RemoveUnwantedTags(html, new[] { "p" }); // Keep only <p> tags

// Get inner text
var doc = new HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectSingleNode("//div");
var innerText = Helper.GetInnerText(node, "b"); // Removes <b> tags

Example 4: Pagination

using SMEAppHouse.Core.ScraperBox.Models;

var instruction = new PageInstruction
{
    PadCharacter = '0',
    PadLength = 3,
    PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};

// Generate paginated URLs
for (int i = 1; i <= 10; i++)
{
    var pageNumber = instruction.PageNo(i); // "001", "002", ..., "010"
    var url = $"https://example.com/page/{pageNumber}";
    Console.WriteLine(url);
}

Example 5: Node Collection Extraction

using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;

var html = Helper.GetPageDocument("https://example.com/products");
var doc = new HtmlDocument();
doc.LoadHtml(html);

// Get all product nodes
var products = Helper.GetNodeCollection(doc.DocumentNode, "div", "class", "product");

foreach (var product in products)
{
    var name = Helper.GetInnerText(product, "h2");
    var price = Helper.GetInnerText(product, "span", "class", "price");
    Console.WriteLine($"{name}: {price}");
}

Example 6: URL Processing

using SMEAppHouse.Core.ScraperBox;

// Resolve relative URLs
var url1 = Helper.ResolveHttpUrl("//example.com/page"); // "http://example.com/page"
var url2 = Helper.ResolveHttpUrl("https://example.com/page"); // "https://example.com/page"

// Extract domain
var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com"

// Validate URL
bool isValid = Helper.IsURLValid("https://example.com");

// Encode query string
var encoded = Helper.EncodeQueryStringSegment("search query & filter");
// Returns "search%20query%20%26%20filter"

Example 7: User Agent Usage

using SMEAppHouse.Core.ScraperBox.Models;
using System.Net.Http;

var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);

var client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", userAgent.UserAgent);

var response = await client.GetAsync("https://example.com");

Key Features

Web Page Fetching: Multiple methods for retrieving web page content
Proxy Support: Full support for HTTP proxies with authentication
HTML Parsing: Integration with HtmlAgilityPack for DOM manipulation
HTML Cleaning: Utilities for cleaning and processing HTML
Node Operations: Methods for finding and extracting HTML nodes
Pagination Support: Helper for constructing paginated URLs
User Agent Management: Predefined user agent strings
URL Processing: Utilities for URL manipulation and validation

Dependencies

HtmlAgilityPack (v1.12.3)
ScrapySharp (v3.0.0)
SMEAppHouse.Core.CodeKits

Notes

Uses HtmlAgilityPack for HTML parsing
Proxy support includes authentication via credentials
All HTML operations are case-sensitive for tag names
XPath expressions are supported for node selection
User agent strings are predefined for common browsers
PageInstruction format: '{padChar}-{padLength}-{direction}' where direction is 0 (left) or 1 (right)
Proxy protocols support HTTP, HTTPS, and SOCKS4/5
Anonymity levels indicate how well the proxy hides your IP

License

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- HtmlAgilityPack (>= 1.12.3)
- ScrapySharp (>= 3.0.0)
- SMEAppHouse.Core.CodeKits (>= 9.0.9)

NuGet packages (2)

Showing the top 2 NuGet packages that depend on SMEAppHouse.Core.ScraperBox:

Package	Downloads
SMEAppHouse.Core.ScraperBox.Selenium Library for handling Selenium functionalities.	248
SMEAppHouse.Core.FreeIPProxy Library for generating usable proxy IP usable when making http requests anonymously.	213

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
9.0.9	142	11/29/2025
9.0.8	142	11/29/2025
9.0.7	115	11/29/2025
1.4.1906.15	828	6/12/2019
1.4.1906.14	785	6/9/2019
1.4.1811.13	738	6/9/2019

release notes

SMEAppHouse.Core.ScraperBox 9.0.9

SMEAppHouse.Core.ScraperBox

Overview

Public Classes and Utilities

1. Helper (Static Class)

URL and HTTP Operations

ResolveHttpUrl

ExtractDomainNameFromUrl

IsURLValid

Page Document Retrieval

GetPageDocument (Multiple Overloads)

GetPageDocumentWithCookie

HTML Processing

Resolve

CleanupHtmlStrains

RemoveHtmlComments

RemoveUnwantedTags

HTML Node Operations

GetInnerText (Multiple Overloads)

GetNode

GetNodeByInnerHtml

GetNodeByAttribute

GetNodeCollection

Query String Operations

EncodeQueryStringSegment

Proxy Operations

FindProxyCountryFromPartial

2. Models

IPProxy

ToWebProxy

ToNetworkCredential

AsTuple

GetLastValidationElapsedTime

PageInstruction

PageNo

UserAgents

GetFakeUserAgent

AuthenticationMethod

3. Rules and Enums

HttpOpsRules

IPProxyRules

Complete Usage Examples

Example 1: Basic Web Scraping

Example 2: Scraping with Proxy

Example 3: HTML Processing

Example 4: Pagination

Example 5: Node Collection Extraction

Example 6: URL Processing

Example 7: User Agent Usage

Key Features

Dependencies

Notes

License

net8.0

NuGet packages (2)

GitHub repositories