OxidizePdf.NET 0.3.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package OxidizePdf.NET --version 0.3.0
                    
NuGet\Install-Package OxidizePdf.NET -Version 0.3.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="OxidizePdf.NET" Version="0.3.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="OxidizePdf.NET" Version="0.3.0" />
                    
Directory.Packages.props
<PackageReference Include="OxidizePdf.NET" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add OxidizePdf.NET --version 0.3.0
                    
#r "nuget: OxidizePdf.NET, 0.3.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package OxidizePdf.NET@0.3.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=OxidizePdf.NET&version=0.3.0
                    
Install as a Cake Addin
#tool nuget:?package=OxidizePdf.NET&version=0.3.0
                    
Install as a Cake Tool

OxidizePdf.NET

NuGet License: AGPL-3.0 .NET

.NET bindings for oxidize-pdf - Fast, memory-safe PDF text extraction optimized for RAG/LLM pipelines with intelligent chunking.

Features

  • 🚀 High Performance - Native Rust speed (3,000-4,000 pages/second)
  • 🧠 AI/RAG Optimized - Intelligent text chunking with sentence boundaries
  • 🛡️ Memory Safe - Zero-copy FFI with automatic resource management
  • 🌍 Cross-Platform - Linux, Windows, macOS (x64)
  • 📦 Zero Dependencies - Self-contained native binaries in NuGet package
  • 🔍 Metadata Rich - Page numbers, confidence scores, bounding boxes

Installation

dotnet add package OxidizePdf.NET

Quick Start

Basic Text Extraction

using OxidizePdf.NET;

// Extract all text from PDF
using var extractor = new PdfExtractor();
byte[] pdfBytes = File.ReadAllBytes("document.pdf");

string text = await extractor.ExtractTextAsync(pdfBytes);
Console.WriteLine(text);

AI/RAG Integration with KernelMemory

using OxidizePdf.NET;
using Microsoft.KernelMemory;

var extractor = new PdfExtractor();
var memory = new KernelMemoryBuilder().Build();

// Extract chunks optimized for embeddings
var chunks = await extractor.ExtractChunksAsync(
    pdfBytes,
    new ChunkOptions
    {
        MaxChunkSize = 512,                // Token limit for embedding model
        Overlap = 50,                      // Context overlap between chunks
        PreserveSentenceBoundaries = true, // No mid-sentence cuts
        IncludeMetadata = true             // Page numbers, confidence scores
    }
);

// Store in vector database
foreach (var chunk in chunks)
{
    await memory.ImportTextAsync(
        text: chunk.Text,
        documentId: $"doc_{chunk.PageNumber}_{chunk.Index}",
        tags: new Dictionary<string, object>
        {
            ["source"] = "SharePoint/Documents/report.pdf",
            ["page"] = chunk.PageNumber,
            ["confidence"] = chunk.Confidence
        }
    );
}

SharePoint Crawler Example

using OxidizePdf.NET;
using Microsoft.Graph;

var extractor = new PdfExtractor();
var graphClient = new GraphServiceClient(...);

// Crawl SharePoint document library
var driveItems = await graphClient.Sites["root"]
    .Drives["Documents"]
    .Root
    .Children
    .Request()
    .Filter("endsWith(name,'.pdf')")
    .GetAsync();

foreach (var item in driveItems)
{
    var stream = await graphClient.Sites["root"]
        .Drives["Documents"]
        .Items[item.Id]
        .Content
        .Request()
        .GetAsync();

    using var ms = new MemoryStream();
    await stream.CopyToAsync(ms);

    var chunks = await extractor.ExtractChunksAsync(ms.ToArray());

    // Process chunks for embeddings...
}

Performance

Based on oxidize-pdf v1.6.4 benchmarks:

  • Text Extraction: 3,000-4,000 pages/second
  • Chunking: 0.62ms for 100 pages
  • Memory Overhead: <1MB per document
  • PDF Parsing: 98.8% success rate on 759 real-world PDFs

Supported Platforms

Platform Runtime Identifier Status
Linux x64 linux-x64 ✅ Supported
Windows x64 win-x64 ✅ Supported
macOS x64 osx-x64 ✅ Supported

Native binaries are automatically included in the NuGet package.

Architecture

  • native/ - Rust FFI layer (cdylib)
  • dotnet/ - C# wrapper with P/Invoke
  • examples/ - Integration examples (KernelMemory, BasicUsage)

See ARCHITECTURE.md for detailed design decisions.

API Reference

PdfExtractor

public class PdfExtractor : IDisposable
{
    // Extract plain text from PDF
    public Task<string> ExtractTextAsync(byte[] pdfBytes);

    // Extract text chunks optimized for RAG/LLM
    public Task<DocumentChunks> ExtractChunksAsync(
        byte[] pdfBytes,
        ChunkOptions options = null
    );

    // Extract metadata (page count, title, author)
    public Task<PdfMetadata> ExtractMetadataAsync(byte[] pdfBytes);
}

ChunkOptions

public class ChunkOptions
{
    public int MaxChunkSize { get; set; } = 512;          // Max tokens per chunk
    public int Overlap { get; set; } = 50;                // Overlap between chunks
    public bool PreserveSentenceBoundaries { get; set; } = true;
    public bool IncludeMetadata { get; set; } = true;
}

DocumentChunk

public class DocumentChunk
{
    public int Index { get; set; }             // Chunk index in document
    public int PageNumber { get; set; }        // Source page number
    public string Text { get; set; }           // Chunk text content
    public double Confidence { get; set; }     // Extraction confidence (0.0-1.0)
    public BoundingBox BoundingBox { get; set; } // Optional spatial info
}

Requirements

  • .NET 8.0+ (tested on .NET 8, 9)
  • Native Runtime: Automatically included in NuGet package

Note: .NET 6 support was dropped in v0.2.0 as it reached end-of-support in November 2024. Use v0.1.0 if you still require .NET 6 compatibility.

Building from Source

# Clone repository
git clone https://github.com/bzsanti/oxidize-pdf-dotnet.git
cd oxidize-pdf-dotnet

# Build native library
cd native
cargo build --release

# Build .NET wrapper
cd ../dotnet
dotnet build

# Run tests
dotnet test

Examples

See examples/ directory:

  • BasicUsage/ - Simple text extraction
  • KernelMemory/ - Full SharePoint crawler with RAG pipeline

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see LICENSE file.

This is consistent with the underlying oxidize-pdf library which is also licensed under AGPL-3.0.

Key Points:

  • ✅ Free for open-source projects
  • ✅ Commercial use allowed (must share modifications)
  • ⚠️ Network use = distribution (must share source)
  • ⚠️ If you use this in a web service, you must make your code public

For commercial licensing or questions, contact: licensing@belowzero.tech

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Acknowledgments

Built on top of oxidize-pdf by Santiago Fernández Muñoz.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.6.0 85 3/21/2026
0.5.0 87 3/18/2026
0.4.0 111 3/15/2026
0.3.1 91 3/9/2026
0.3.0 83 3/6/2026
0.2.2 447 12/10/2025
0.2.1 443 12/8/2025
0.2.0 222 11/4/2025
0.1.0 208 11/4/2025