OxidizePdf.NET 0.6.0

dotnet add package OxidizePdf.NET --version 0.6.0
                    
NuGet\Install-Package OxidizePdf.NET -Version 0.6.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="OxidizePdf.NET" Version="0.6.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="OxidizePdf.NET" Version="0.6.0" />
                    
Directory.Packages.props
<PackageReference Include="OxidizePdf.NET" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add OxidizePdf.NET --version 0.6.0
                    
#r "nuget: OxidizePdf.NET, 0.6.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package OxidizePdf.NET@0.6.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=OxidizePdf.NET&version=0.6.0
                    
Install as a Cake Addin
#tool nuget:?package=OxidizePdf.NET&version=0.6.0
                    
Install as a Cake Tool

OxidizePdf.NET

NuGet License: MIT .NET

.NET bindings for oxidize-pdf - Fast, memory-safe PDF text extraction optimized for RAG/LLM pipelines with intelligent chunking.

Features

  • 🚀 High Performance - Native Rust speed (3,000-4,000 pages/second)
  • 🧠 AI/RAG Optimized - Intelligent text chunking with sentence boundaries
  • 🛡️ Memory Safe - Zero-copy FFI with automatic resource management
  • 🌍 Cross-Platform - Linux, Windows, macOS (x64)
  • 📦 Zero Dependencies - Self-contained native binaries in NuGet package
  • 🔍 Metadata Rich - Page numbers, confidence scores, bounding boxes

Installation

dotnet add package OxidizePdf.NET

Quick Start

Basic Text Extraction

using OxidizePdf.NET;

// Extract all text from PDF
using var extractor = new PdfExtractor();
byte[] pdfBytes = File.ReadAllBytes("document.pdf");

string text = await extractor.ExtractTextAsync(pdfBytes);
Console.WriteLine(text);

AI/RAG Integration with KernelMemory

using OxidizePdf.NET;
using Microsoft.KernelMemory;

var extractor = new PdfExtractor();
var memory = new KernelMemoryBuilder().Build();

// Extract chunks optimized for embeddings
var chunks = await extractor.ExtractChunksAsync(
    pdfBytes,
    new ChunkOptions
    {
        MaxChunkSize = 512,                // Token limit for embedding model
        Overlap = 50,                      // Context overlap between chunks
        PreserveSentenceBoundaries = true, // No mid-sentence cuts
        IncludeMetadata = true             // Page numbers, confidence scores
    }
);

// Store in vector database
foreach (var chunk in chunks)
{
    await memory.ImportTextAsync(
        text: chunk.Text,
        documentId: $"doc_{chunk.PageNumber}_{chunk.Index}",
        tags: new Dictionary<string, object>
        {
            ["source"] = "SharePoint/Documents/report.pdf",
            ["page"] = chunk.PageNumber,
            ["confidence"] = chunk.Confidence
        }
    );
}

SharePoint Crawler Example

using OxidizePdf.NET;
using Microsoft.Graph;

var extractor = new PdfExtractor();
var graphClient = new GraphServiceClient(...);

// Crawl SharePoint document library
var driveItems = await graphClient.Sites["root"]
    .Drives["Documents"]
    .Root
    .Children
    .Request()
    .Filter("endsWith(name,'.pdf')")
    .GetAsync();

foreach (var item in driveItems)
{
    var stream = await graphClient.Sites["root"]
        .Drives["Documents"]
        .Items[item.Id]
        .Content
        .Request()
        .GetAsync();

    using var ms = new MemoryStream();
    await stream.CopyToAsync(ms);

    var chunks = await extractor.ExtractChunksAsync(ms.ToArray());

    // Process chunks for embeddings...
}

Performance

Based on oxidize-pdf v1.6.4 benchmarks:

  • Text Extraction: 3,000-4,000 pages/second
  • Chunking: 0.62ms for 100 pages
  • Memory Overhead: <1MB per document
  • PDF Parsing: 98.8% success rate on 759 real-world PDFs

Supported Platforms

Platform Runtime Identifier Status
Linux x64 linux-x64 ✅ Supported
Windows x64 win-x64 ✅ Supported
macOS x64 osx-x64 ✅ Supported

Native binaries are automatically included in the NuGet package.

Architecture

  • native/ - Rust FFI layer (cdylib)
  • dotnet/ - C# wrapper with P/Invoke
  • examples/ - Integration examples (KernelMemory, BasicUsage)

See ARCHITECTURE.md for detailed design decisions.

API Reference

PdfExtractor

public class PdfExtractor : IDisposable
{
    // Extract plain text from PDF
    public Task<string> ExtractTextAsync(byte[] pdfBytes);

    // Extract text chunks optimized for RAG/LLM
    public Task<DocumentChunks> ExtractChunksAsync(
        byte[] pdfBytes,
        ChunkOptions options = null
    );

    // Extract metadata (page count, title, author)
    public Task<PdfMetadata> ExtractMetadataAsync(byte[] pdfBytes);
}

ChunkOptions

public class ChunkOptions
{
    public int MaxChunkSize { get; set; } = 512;          // Max tokens per chunk
    public int Overlap { get; set; } = 50;                // Overlap between chunks
    public bool PreserveSentenceBoundaries { get; set; } = true;
    public bool IncludeMetadata { get; set; } = true;
}

DocumentChunk

public class DocumentChunk
{
    public int Index { get; set; }             // Chunk index in document
    public int PageNumber { get; set; }        // Source page number
    public string Text { get; set; }           // Chunk text content
    public double Confidence { get; set; }     // Extraction confidence (0.0-1.0)
    public BoundingBox BoundingBox { get; set; } // Optional spatial info
}

Requirements

  • .NET 8.0+ (tested on .NET 8, 9)
  • Native Runtime: Automatically included in NuGet package

Note: .NET 6 support was dropped in v0.2.0 as it reached end-of-support in November 2024. Use v0.1.0 if you still require .NET 6 compatibility.

Building from Source

# Clone repository
git clone https://github.com/bzsanti/oxidize-pdf-dotnet.git
cd oxidize-pdf-dotnet

# Build native library
cd native
cargo build --release

# Build .NET wrapper
cd ../dotnet
dotnet build

# Run tests
dotnet test

Examples

See examples/ directory:

  • BasicUsage/ - Simple text extraction
  • KernelMemory/ - Full SharePoint crawler with RAG pipeline

License

This project is licensed under the MIT License - see LICENSE file.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Acknowledgments

Built on top of oxidize-pdf by Santiago Fernández Muñoz.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net10.0

    • No dependencies.
  • net8.0

    • No dependencies.
  • net9.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.6.0 85 3/21/2026
0.5.0 87 3/18/2026
0.4.0 111 3/15/2026
0.3.1 90 3/9/2026
0.3.0 83 3/6/2026
0.2.2 447 12/10/2025
0.2.1 443 12/8/2025
0.2.0 222 11/4/2025
0.1.0 208 11/4/2025