Robbiblubber.Documents.Pdf 1.0.4

dotnet add package Robbiblubber.Documents.Pdf --version 1.0.4
                    
NuGet\Install-Package Robbiblubber.Documents.Pdf -Version 1.0.4
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Robbiblubber.Documents.Pdf" Version="1.0.4" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Robbiblubber.Documents.Pdf" Version="1.0.4" />
                    
Directory.Packages.props
<PackageReference Include="Robbiblubber.Documents.Pdf" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Robbiblubber.Documents.Pdf --version 1.0.4
                    
#r "nuget: Robbiblubber.Documents.Pdf, 1.0.4"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Robbiblubber.Documents.Pdf@1.0.4
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Robbiblubber.Documents.Pdf&version=1.0.4
                    
Install as a Cake Addin
#tool nuget:?package=Robbiblubber.Documents.Pdf&version=1.0.4
                    
Install as a Cake Tool

robbiblubber.org Documents PDF Library

The robbiblubber.org Documents PDF Library provides utilities for working with PDF documents. Its primary component is the PdfTextExtractor, which offers general-purpose PDF text extraction with optional OCR support.

Its main features include:

  • Native text extraction using PdfPig.
  • Automatic OCR fallback using Tesseract for scanned pages.
  • Optional table reconstruction in Markdown or TSV format.
  • Automatic orientation detection and OCR preprocessing.
  • Parallel page processing for improved performance.

The repository can be found at https://bitbucket.org/robbiblubber/robbiblubber.documents.pdf.net/. API documentation is available from https://robbiblubber.lima-city.org/. For general information see http://robbiblubber.org/.

Robbiblubber.Documents.Pdf depends on Tesseract. Applications using OCR should reference the Tesseract NuGet package directly so that the native runtime binaries are copied to the output directory.

The robbiblubber.org Documents PDF Library is provided under the MIT license and may be used without restrictions under the conditions of this license agreement.

PdfTextExtractor

PdfTextExtractor is a class for extracting text from PDF documents. It provides general-purpose text extraction that is intended to be a good compromise between accuracy and performance and is based upon availabe open-source libraries.

The library builds upon the following packages:

Tesseract data

Tesseract requires language data files to perform OCR. These files can be downloaded from the Tesseract GitHub repository: https://github.com/tesseract-ocr/ The Tesseract data files can be placed in the "tessdata" directory within the application's working directory. There is a configuration option to specify a custom path for the Tesseract data files if needed.

Configuration definition

The PdfTextExtractor.Options class defines the configuration options for the PdfTextExtractor. The following options are available:

  • Languages: A list of language codes for OCR. The default is "eng+deu" (English and German). The language codes should correspond to the Tesseract data files available in the "tessdata" directory.
  • TessDataPath: The path to the Tesseract data files. If not specified, it defaults to the current value of the TESSDATA_PREFIX environment variable or the "tessdata" directory in the application's working directory.
  • RenderDpi: The DPI to use when rendering PDF pages for OCR. The default is 300 DPI. Higher DPI can improve OCR accuracy but may increase processing time and memory usage.
  • PreprocessForOcr: A boolean option to enable or disable preprocessing of images for OCR. When enabled, it applies image processing techniques to enhance the quality of the images before performing OCR, which can improve accuracy. The default is true.
  • AlwaysOcr: A boolean option to force OCR on all pages, regardless of whether text extraction was successful. When enabled, it performs OCR on all pages, which can be useful for documents that contain a mix of text and images or for documents where text extraction may not be reliable. The default is false.
  • OcrFallbackMinWords: The minimum number of words required for text extraction to be considered successful. If extracted text contains fewer words than this threshold, OCR will be performed on the page. Default is 5.
  • EnableOcrDiagnostics: Allows OCR to output diagnostic information to stderr. Default is false.
  • MinNativeLetters: The minimum number of letters required for text extraction to be considered successful. If extracted text contains fewer letters than this threshold, OCR will be performed on the page. Default is 10.
  • MaxDegreeOfParallelism: The maximum number of concurrent threads to use for processing PDF pages. The default corresponds to the number of logical processors on the system. Adjusting this value can help optimize performance and resource utilization.
  • Mode: The Text extraction mode that mainly determines how tables are represented in the extracted text. Using Mode.STRUCTURED_TABLES results in a table representation that is more suitable for further processing, while Mode.LAYOUT_TEXT typically results in a more human-readable output. The default is STRUCTURED_TABLES.
  • TableFormat: The format to use for representing tables in the extracted text. TableFormat.MARKDOWN will render tables in Markdown format, TableFormat.TSV will render tables in TSV format. The default is MARKDOWN.
  • FilterGarbageLines: A boolean option to enable or disable filtering of lines that are likely to be garbage. When enabled, it applies heuristics to identify and remove lines that are unlikely to contain meaningful text, which can help improve the quality of the extracted text. The default is true.
  • ScoreWords: A list of words used to identify valid text. This is used for a more confident orientation detection. The default contains common words in English and German. ScoreWords should be set if other languages are used.

Container/Linux distribution

A sample Dockerfile is included in the repository to demonstrate how to create a container image for Linux environments. See: https://bitbucket.org/robbiblubber/robbiblubber.documents.pdf.net/src/master/Robbiblubber.Documents.Pdf.Cli/Dockerfile

Example

using Robbiblubber.Documents.Pdf;

PdfTextExtractor.Options options = new()
{
    Languages = "eng+deu",
    Mode = PdfTextExtractor.Mode.STRUCTURED_TABLES
};

using PdfTextExtractor extractor = new(options);

PdfExtractionResult result = extractor.Extract("example.pdf");

Console.WriteLine(result.Text);

foreach(PageExtractionResult page in result.LowConfidencePages())
{
    Console.WriteLine($"Warning: Page {page.PageNumber} has low OCR confidence.");
}
Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.4 99 6/15/2026