SMEAppHouse.Core.ScraperBox
9.0.9
dotnet add package SMEAppHouse.Core.ScraperBox --version 9.0.9
NuGet\Install-Package SMEAppHouse.Core.ScraperBox -Version 9.0.9
<PackageReference Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />
<PackageVersion Include="SMEAppHouse.Core.ScraperBox" Version="9.0.9" />
<PackageReference Include="SMEAppHouse.Core.ScraperBox" />
paket add SMEAppHouse.Core.ScraperBox --version 9.0.9
#r "nuget: SMEAppHouse.Core.ScraperBox, 9.0.9"
#:package SMEAppHouse.Core.ScraperBox@9.0.9
#addin nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
#tool nuget:?package=SMEAppHouse.Core.ScraperBox&version=9.0.9
SMEAppHouse.Core.ScraperBox
Overview
SMEAppHouse.Core.ScraperBox is a library for web scraping operations. It provides utilities for fetching web pages, parsing HTML, working with proxies, handling cookies, and various helper methods for web scraping tasks.
Target Framework: .NET 8.0
Namespace: SMEAppHouse.Core.ScraperBox
Public Classes and Utilities
1. Helper (Static Class)
Main utility class for web scraping operations.
Namespace: SMEAppHouse.Core.ScraperBox
URL and HTTP Operations
ResolveHttpUrl
public static string ResolveHttpUrl(string url)
Resolves URLs that start with // to http://.
Example:
var url = Helper.ResolveHttpUrl("//example.com/page"); // Returns "http://example.com/page"
ExtractDomainNameFromUrl
public static string ExtractDomainNameFromUrl(string url, bool retainHttPrefix = false)
Extracts the domain name from a URL.
Example:
var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com" or "https://www.example.com" if retainHttPrefix is true
IsURLValid
public static bool IsURLValid(string url, bool brute = false)
Validates if a URL is valid.
Page Document Retrieval
GetPageDocument (Multiple Overloads)
public static string GetPageDocument(string site)
public static string GetPageDocument(string site, IWebProxy webProxy, ref string extraDataOnError)
public static string GetPageDocument(Uri site, ...)
public static string GetPageDocument(string sourceUrl, ref string extraDataOnError, IWebProxy webProxy = null, ...)
Retrieves HTML content from a web page with optional proxy support.
Example:
// Simple fetch
var html = Helper.GetPageDocument("https://example.com");
// With proxy
var proxy = new WebProxy("127.0.0.1", 8080);
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", proxy, ref errorData);
GetPageDocumentWithCookie
public static string GetPageDocumentWithCookie(string url)
Retrieves page content with cookie support.
HTML Processing
Resolve
public static string Resolve(string val, bool allTrim = false, params string[] otherElementsToClear)
Cleans and resolves HTML entities and encoded characters.
Example:
var cleaned = Helper.Resolve("&Hello%20World", allTrim: true);
// Returns "&Hello World"
CleanupHtmlStrains
public static string CleanupHtmlStrains(string val, bool allTrim = false)
Removes HTML entities and unwanted characters.
RemoveHtmlComments
public static string RemoveHtmlComments(string sourceHtml)
Removes HTML comments from source.
RemoveUnwantedTags
public static string RemoveUnwantedTags(string data)
public static string RemoveUnwantedTags(string data, string[] acceptableTags)
Removes unwanted HTML tags, optionally keeping only specified tags.
HTML Node Operations
GetInnerText (Multiple Overloads)
public static string GetInnerText(HtmlNode sourceNode, ...)
public static string GetInnerText(HtmlNode node, params string[] tagsToRemove)
public static string GetInnerText(HtmlNode node, bool removeCommentTags = true, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, params string[] tagsToRemove)
public static string GetInnerText(string sourceHtml, bool removeCommentTags = true, params string[] tagsToRemove)
Extracts inner text from HTML nodes or strings.
Example:
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml("<div><p>Hello <b>World</b></p></div>");
var node = htmlDoc.DocumentNode.SelectSingleNode("//div");
var text = Helper.GetInnerText(node, "b"); // Returns "Hello World" (removes <b> tags)
GetNode
public static HtmlNode GetNode(HtmlNode node, ...)
Gets a specific HTML node using XPath or other selectors.
GetNodeByInnerHtml
public static HtmlNode GetNodeByInnerHtml(HtmlNode node, ...)
Finds a node by its inner HTML content.
GetNodeByAttribute
public static HtmlNode GetNodeByAttribute(HtmlNode node, ...)
Finds a node by attribute value.
GetNodeCollection
public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, ...)
public static IEnumerable<HtmlNode> GetNodeCollection(HtmlNode node, params string[] element)
Gets a collection of HTML nodes.
Query String Operations
EncodeQueryStringSegment
public static string EncodeQueryStringSegment(string query)
Encodes query string segments.
Example:
var encoded = Helper.EncodeQueryStringSegment("hello world & test");
// Returns "hello%20world%20%26%20test"
Proxy Operations
FindProxyCountryFromPartial
public static Rules.WorldCountriesEnum FindProxyCountryFromPartial(string countryNamePartial)
Finds a country enum value from a partial country name.
Example:
var country = Helper.FindProxyCountryFromPartial("united"); // Returns WorldCountriesEnum.UNITED_STATES
2. Models
IPProxy
Represents an IP proxy server.
Namespace: SMEAppHouse.Core.ScraperBox.Models
Properties:
Guid Id- Unique identifierstring ProviderId- Proxy provider IDstring IPAddress- Proxy IP addressint PortNo- Proxy port numberRules.WorldCountriesEnum Country- Country of proxyIPProxyRules.ProxyAnonymityLevelsEnum AnonymityLevel- Anonymity levelIPProxyRules.ProxyProtocolsEnum Protocol- Protocol (HTTP, HTTPS, SOCKS)DateTime LastChecked- Last validation timeint ResponseRate- Response rate percentageint SpeedRate- Speed in millisecondsTimeSpan SpeedTimeSpan- Speed as TimeSpanstring ISP- Internet Service Providerstring City- City locationIPProxyRules.ProxySpeedsEnum Speed- Speed categoryIPProxyRules.ProxyConnectionSpeedsEnum ConnectionTime- Connection speedGuid CheckerTokenId- Checker token identifierCheckStatusEnum CheckStatus- Current check statusTuple<string, string> Credential- Username/password credentials
Methods:
ToWebProxy
public IWebProxy ToWebProxy()
Converts to IWebProxy for use with HTTP clients.
ToNetworkCredential
public NetworkCredential ToNetworkCredential()
Converts credentials to NetworkCredential.
AsTuple
public Tuple<string, string> AsTuple()
Returns IP and port as a tuple.
GetLastValidationElapsedTime
public TimeSpan GetLastValidationElapsedTime()
Gets time elapsed since last validation.
CheckStatusEnum:
NotCheckedCheckingCheckedCheckedInvalid
Example:
var proxy = new IPProxy
{
IPAddress = "192.168.1.1",
PortNo = 8080,
Country = Rules.WorldCountriesEnum.UNITED_STATES,
AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
Credential = new Tuple<string, string>("username", "password")
};
// Use with HTTP client
var webProxy = proxy.ToWebProxy();
var credential = proxy.ToNetworkCredential();
PageInstruction
Represents pagination instructions for URL construction.
Namespace: SMEAppHouse.Core.ScraperBox.Models
Properties:
char PadCharacter- Character used for paddingint PadLength- Length of paddingPaddingDirectionsEnum PaddingDirection- Direction of padding (Left or Right)
PaddingDirectionsEnum:
ToLeft- Pad to the leftToRight- Pad to the right
Extension Method:
PageNo
public static string PageNo(this PageInstruction pgInstruction, int pageNo)
Formats a page number according to the instruction.
Example:
var instruction = new PageInstruction
{
PadCharacter = '0',
PadLength = 3,
PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};
var pageNumber = instruction.PageNo(5); // Returns "005"
UserAgents
Type-safe enum pattern for user agent strings.
Namespace: SMEAppHouse.Core.ScraperBox.Models
Static Properties:
UserAgents Mozilla22UserAgents FireFox36UserAgents FireFox33UserAgents Chrome41022280UserAgents InternetExplorer8
Methods:
GetFakeUserAgent
public static FakeUserAgent GetFakeUserAgent(UserAgents userAgent)
Gets the fake user agent string.
Example:
var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);
// Returns "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36..."
AuthenticationMethod
Type-safe enum for authentication methods.
Namespace: SMEAppHouse.Core.ScraperBox.Models
Static Properties:
AuthenticationMethod FORMSAuthenticationMethod WINDOWSAUTHENTICATIONAuthenticationMethod SINGLESIGNON
3. Rules and Enums
HttpOpsRules
HTTP operation rules and constants.
Namespace: SMEAppHouse.Core.ScraperBox
HttpMethodConsts Enum:
GET,POST,PUT,HEAD,TRACE,DELETE,SEARCH,CONNECT,PROPFIND,PROPPATCH,PATCH,MKCOL,COPY,MOVE,LOCK,UNLOCK,OPTIONS
ContentTypeConsts Enum:
XmlJson
IPProxyRules
IP proxy rules and enums.
Namespace: SMEAppHouse.Core.ScraperBox
ProxyAnonymityLevelsEnum:
Elite- Highly anonymous (Level 1)Anonymous- Anonymous (Level 2)Transparent- Transparent (Level 3)
ProxySpeedsEnum:
SlowMediumFast
ProxyConnectionSpeedsEnum:
SlowMediumFast
ProxyProtocolsEnum:
HTTPHTTPSSOCKS4_5
Complete Usage Examples
Example 1: Basic Web Scraping
using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;
// Fetch a web page
var html = Helper.GetPageDocument("https://example.com");
// Parse HTML
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Extract data
var title = doc.DocumentNode.SelectSingleNode("//title")?.InnerText;
var links = doc.DocumentNode.SelectNodes("//a[@href]")
.Select(a => a.GetAttributeValue("href", ""))
.ToList();
Example 2: Scraping with Proxy
using SMEAppHouse.Core.ScraperBox;
using SMEAppHouse.Core.ScraperBox.Models;
// Create proxy
var proxy = new IPProxy
{
IPAddress = "192.168.1.1",
PortNo = 8080,
Protocol = IPProxyRules.ProxyProtocolsEnum.HTTP,
AnonymityLevel = IPProxyRules.ProxyAnonymityLevelsEnum.Elite,
Credential = new Tuple<string, string>("user", "pass")
};
// Fetch with proxy
var webProxy = proxy.ToWebProxy();
string errorData = null;
var html = Helper.GetPageDocument("https://example.com", webProxy, ref errorData);
if (!string.IsNullOrEmpty(errorData))
{
Console.WriteLine($"Error: {errorData}");
}
Example 3: HTML Processing
using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;
var html = "<div><p>Hello <b>World</b> & Friends</p></div>";
// Clean HTML
var cleaned = Helper.CleanupHtmlStrains(html, allTrim: true);
// Remove comments
var noComments = Helper.RemoveHtmlComments(html);
// Remove unwanted tags
var textOnly = Helper.RemoveUnwantedTags(html, new[] { "p" }); // Keep only <p> tags
// Get inner text
var doc = new HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectSingleNode("//div");
var innerText = Helper.GetInnerText(node, "b"); // Removes <b> tags
Example 4: Pagination
using SMEAppHouse.Core.ScraperBox.Models;
var instruction = new PageInstruction
{
PadCharacter = '0',
PadLength = 3,
PaddingDirection = PageInstruction.PaddingDirectionsEnum.ToLeft
};
// Generate paginated URLs
for (int i = 1; i <= 10; i++)
{
var pageNumber = instruction.PageNo(i); // "001", "002", ..., "010"
var url = $"https://example.com/page/{pageNumber}";
Console.WriteLine(url);
}
Example 5: Node Collection Extraction
using SMEAppHouse.Core.ScraperBox;
using HtmlAgilityPack;
var html = Helper.GetPageDocument("https://example.com/products");
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Get all product nodes
var products = Helper.GetNodeCollection(doc.DocumentNode, "div", "class", "product");
foreach (var product in products)
{
var name = Helper.GetInnerText(product, "h2");
var price = Helper.GetInnerText(product, "span", "class", "price");
Console.WriteLine($"{name}: {price}");
}
Example 6: URL Processing
using SMEAppHouse.Core.ScraperBox;
// Resolve relative URLs
var url1 = Helper.ResolveHttpUrl("//example.com/page"); // "http://example.com/page"
var url2 = Helper.ResolveHttpUrl("https://example.com/page"); // "https://example.com/page"
// Extract domain
var domain = Helper.ExtractDomainNameFromUrl("https://www.example.com/path/page");
// Returns "www.example.com"
// Validate URL
bool isValid = Helper.IsURLValid("https://example.com");
// Encode query string
var encoded = Helper.EncodeQueryStringSegment("search query & filter");
// Returns "search%20query%20%26%20filter"
Example 7: User Agent Usage
using SMEAppHouse.Core.ScraperBox.Models;
using System.Net.Http;
var userAgent = UserAgents.GetFakeUserAgent(UserAgents.Chrome41022280);
var client = new HttpClient();
client.DefaultRequestHeaders.Add("User-Agent", userAgent.UserAgent);
var response = await client.GetAsync("https://example.com");
Key Features
- Web Page Fetching: Multiple methods for retrieving web page content
- Proxy Support: Full support for HTTP proxies with authentication
- HTML Parsing: Integration with HtmlAgilityPack for DOM manipulation
- HTML Cleaning: Utilities for cleaning and processing HTML
- Node Operations: Methods for finding and extracting HTML nodes
- Pagination Support: Helper for constructing paginated URLs
- User Agent Management: Predefined user agent strings
- URL Processing: Utilities for URL manipulation and validation
Dependencies
- HtmlAgilityPack (v1.12.3)
- ScrapySharp (v3.0.0)
- SMEAppHouse.Core.CodeKits
Notes
- Uses HtmlAgilityPack for HTML parsing
- Proxy support includes authentication via credentials
- All HTML operations are case-sensitive for tag names
- XPath expressions are supported for node selection
- User agent strings are predefined for common browsers
- PageInstruction format:
'{padChar}-{padLength}-{direction}'where direction is 0 (left) or 1 (right) - Proxy protocols support HTTP, HTTPS, and SOCKS4/5
- Anonymity levels indicate how well the proxy hides your IP
License
Copyright © SME App House 2025
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- HtmlAgilityPack (>= 1.12.3)
- ScrapySharp (>= 3.0.0)
- SMEAppHouse.Core.CodeKits (>= 9.0.9)
NuGet packages (2)
Showing the top 2 NuGet packages that depend on SMEAppHouse.Core.ScraperBox:
| Package | Downloads |
|---|---|
|
SMEAppHouse.Core.ScraperBox.Selenium
Library for handling Selenium functionalities. |
|
|
SMEAppHouse.Core.FreeIPProxy
Library for generating usable proxy IP usable when making http requests anonymously. |
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 9.0.9 | 142 | 11/29/2025 |
| 9.0.8 | 142 | 11/29/2025 |
| 9.0.7 | 115 | 11/29/2025 |
| 1.4.1906.15 | 828 | 6/12/2019 |
| 1.4.1906.14 | 785 | 6/9/2019 |
| 1.4.1811.13 | 738 | 6/9/2019 |
release notes