Robbiblubber.Documents.Pdf
1.0.4
dotnet add package Robbiblubber.Documents.Pdf --version 1.0.4
NuGet\Install-Package Robbiblubber.Documents.Pdf -Version 1.0.4
<PackageReference Include="Robbiblubber.Documents.Pdf" Version="1.0.4" />
<PackageVersion Include="Robbiblubber.Documents.Pdf" Version="1.0.4" />
<PackageReference Include="Robbiblubber.Documents.Pdf" />
paket add Robbiblubber.Documents.Pdf --version 1.0.4
#r "nuget: Robbiblubber.Documents.Pdf, 1.0.4"
#:package Robbiblubber.Documents.Pdf@1.0.4
#addin nuget:?package=Robbiblubber.Documents.Pdf&version=1.0.4
#tool nuget:?package=Robbiblubber.Documents.Pdf&version=1.0.4
robbiblubber.org Documents PDF Library
The robbiblubber.org Documents PDF Library provides utilities for working with PDF documents. Its primary component is the PdfTextExtractor, which offers general-purpose PDF text extraction with optional OCR support.
Its main features include:
- Native text extraction using PdfPig.
- Automatic OCR fallback using Tesseract for scanned pages.
- Optional table reconstruction in Markdown or TSV format.
- Automatic orientation detection and OCR preprocessing.
- Parallel page processing for improved performance.
The repository can be found at https://bitbucket.org/robbiblubber/robbiblubber.documents.pdf.net/. API documentation is available from https://robbiblubber.lima-city.org/. For general information see http://robbiblubber.org/.
Robbiblubber.Documents.Pdf depends on Tesseract. Applications using OCR should reference the Tesseract NuGet package directly so that the native runtime binaries are copied to the output directory.
The robbiblubber.org Documents PDF Library is provided under the MIT license and may be used without restrictions under the conditions of this license agreement.
PdfTextExtractor
PdfTextExtractor is a class for extracting text from PDF documents. It provides general-purpose text extraction that is intended to be a good compromise between accuracy and performance and is based upon availabe open-source libraries.
The library builds upon the following packages:
- Docnet.Core: https://www.nuget.org/packages/Docnet.Core
- PdfPig: https://www.nuget.org/packages/PdfPig
- Tesseract: https://www.nuget.org/packages/Tesseract
Tesseract data
Tesseract requires language data files to perform OCR. These files can be downloaded from the Tesseract GitHub repository: https://github.com/tesseract-ocr/ The Tesseract data files can be placed in the "tessdata" directory within the application's working directory. There is a configuration option to specify a custom path for the Tesseract data files if needed.
Configuration definition
The PdfTextExtractor.Options class defines the configuration options for the PdfTextExtractor. The following options are available:
- Languages: A list of language codes for OCR. The default is "eng+deu" (English and German). The language codes should correspond to the Tesseract data files available in the "tessdata" directory.
- TessDataPath: The path to the Tesseract data files. If not specified, it defaults to the current value of the TESSDATA_PREFIX environment variable or the "tessdata" directory in the application's working directory.
- RenderDpi: The DPI to use when rendering PDF pages for OCR. The default is 300 DPI. Higher DPI can improve OCR accuracy but may increase processing time and memory usage.
- PreprocessForOcr: A boolean option to enable or disable preprocessing of images for OCR. When enabled, it applies image processing techniques to enhance the quality of the images before performing OCR, which can improve accuracy. The default is true.
- AlwaysOcr: A boolean option to force OCR on all pages, regardless of whether text extraction was successful. When enabled, it performs OCR on all pages, which can be useful for documents that contain a mix of text and images or for documents where text extraction may not be reliable. The default is false.
- OcrFallbackMinWords: The minimum number of words required for text extraction to be considered successful. If extracted text contains fewer words than this threshold, OCR will be performed on the page. Default is 5.
- EnableOcrDiagnostics: Allows OCR to output diagnostic information to stderr. Default is false.
- MinNativeLetters: The minimum number of letters required for text extraction to be considered successful. If extracted text contains fewer letters than this threshold, OCR will be performed on the page. Default is 10.
- MaxDegreeOfParallelism: The maximum number of concurrent threads to use for processing PDF pages. The default corresponds to the number of logical processors on the system. Adjusting this value can help optimize performance and resource utilization.
- Mode: The Text extraction mode that mainly determines how tables are represented in the extracted text. Using Mode.STRUCTURED_TABLES results in a table representation that is more suitable for further processing, while Mode.LAYOUT_TEXT typically results in a more human-readable output. The default is STRUCTURED_TABLES.
- TableFormat: The format to use for representing tables in the extracted text. TableFormat.MARKDOWN will render tables in Markdown format, TableFormat.TSV will render tables in TSV format. The default is MARKDOWN.
- FilterGarbageLines: A boolean option to enable or disable filtering of lines that are likely to be garbage. When enabled, it applies heuristics to identify and remove lines that are unlikely to contain meaningful text, which can help improve the quality of the extracted text. The default is true.
- ScoreWords: A list of words used to identify valid text. This is used for a more confident orientation detection. The default contains common words in English and German. ScoreWords should be set if other languages are used.
Container/Linux distribution
A sample Dockerfile is included in the repository to demonstrate how to create a container image for Linux environments. See: https://bitbucket.org/robbiblubber/robbiblubber.documents.pdf.net/src/master/Robbiblubber.Documents.Pdf.Cli/Dockerfile
Example
using Robbiblubber.Documents.Pdf;
PdfTextExtractor.Options options = new()
{
Languages = "eng+deu",
Mode = PdfTextExtractor.Mode.STRUCTURED_TABLES
};
using PdfTextExtractor extractor = new(options);
PdfExtractionResult result = extractor.Extract("example.pdf");
Console.WriteLine(result.Text);
foreach(PageExtractionResult page in result.LowConfidencePages())
{
Console.WriteLine($"Warning: Page {page.PageNumber} has low OCR confidence.");
}
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- Docnet.Core (>= 2.6.0)
- PdfPig (>= 0.1.14)
- Tesseract (>= 5.2.0)
-
net8.0
- Docnet.Core (>= 2.6.0)
- PdfPig (>= 0.1.14)
- Tesseract (>= 5.2.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.0.4 | 99 | 6/15/2026 |