OxidizePdf.NET 0.3.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package OxidizePdf.NET --version 0.3.0

NuGet\Install-Package OxidizePdf.NET -Version 0.3.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="OxidizePdf.NET" Version="0.3.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="OxidizePdf.NET" Version="0.3.0" />
                    

                            Directory.Packages.props

<PackageReference Include="OxidizePdf.NET" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add OxidizePdf.NET --version 0.3.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: OxidizePdf.NET, 0.3.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package OxidizePdf.NET@0.3.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=OxidizePdf.NET&version=0.3.0
                    

                            Install as a Cake Addin

#tool nuget:?package=OxidizePdf.NET&version=0.3.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

OxidizePdf.NET

.NET bindings for oxidize-pdf - Fast, memory-safe PDF text extraction optimized for RAG/LLM pipelines with intelligent chunking.

Features

🚀 High Performance - Native Rust speed (3,000-4,000 pages/second)
🧠 AI/RAG Optimized - Intelligent text chunking with sentence boundaries
🛡️ Memory Safe - Zero-copy FFI with automatic resource management
🌍 Cross-Platform - Linux, Windows, macOS (x64)
📦 Zero Dependencies - Self-contained native binaries in NuGet package
🔍 Metadata Rich - Page numbers, confidence scores, bounding boxes

Installation

dotnet add package OxidizePdf.NET

Quick Start

Basic Text Extraction

using OxidizePdf.NET;

// Extract all text from PDF
using var extractor = new PdfExtractor();
byte[] pdfBytes = File.ReadAllBytes("document.pdf");

string text = await extractor.ExtractTextAsync(pdfBytes);
Console.WriteLine(text);

AI/RAG Integration with KernelMemory

using OxidizePdf.NET;
using Microsoft.KernelMemory;

var extractor = new PdfExtractor();
var memory = new KernelMemoryBuilder().Build();

// Extract chunks optimized for embeddings
var chunks = await extractor.ExtractChunksAsync(
    pdfBytes,
    new ChunkOptions
    {
        MaxChunkSize = 512,                // Token limit for embedding model
        Overlap = 50,                      // Context overlap between chunks
        PreserveSentenceBoundaries = true, // No mid-sentence cuts
        IncludeMetadata = true             // Page numbers, confidence scores
    }
);

// Store in vector database
foreach (var chunk in chunks)
{
    await memory.ImportTextAsync(
        text: chunk.Text,
        documentId: $"doc_{chunk.PageNumber}_{chunk.Index}",
        tags: new Dictionary<string, object>
        {
            ["source"] = "SharePoint/Documents/report.pdf",
            ["page"] = chunk.PageNumber,
            ["confidence"] = chunk.Confidence
        }
    );
}

SharePoint Crawler Example

using OxidizePdf.NET;
using Microsoft.Graph;

var extractor = new PdfExtractor();
var graphClient = new GraphServiceClient(...);

// Crawl SharePoint document library
var driveItems = await graphClient.Sites["root"]
    .Drives["Documents"]
    .Root
    .Children
    .Request()
    .Filter("endsWith(name,'.pdf')")
    .GetAsync();

foreach (var item in driveItems)
{
    var stream = await graphClient.Sites["root"]
        .Drives["Documents"]
        .Items[item.Id]
        .Content
        .Request()
        .GetAsync();

    using var ms = new MemoryStream();
    await stream.CopyToAsync(ms);

    var chunks = await extractor.ExtractChunksAsync(ms.ToArray());

    // Process chunks for embeddings...
}

Performance

Based on oxidize-pdf v1.6.4 benchmarks:

Text Extraction: 3,000-4,000 pages/second
Chunking: 0.62ms for 100 pages
Memory Overhead: <1MB per document
PDF Parsing: 98.8% success rate on 759 real-world PDFs

Supported Platforms

Platform	Runtime Identifier	Status
Linux x64	`linux-x64`	✅ Supported
Windows x64	`win-x64`	✅ Supported
macOS x64	`osx-x64`	✅ Supported

Native binaries are automatically included in the NuGet package.

Architecture

native/ - Rust FFI layer (cdylib)
dotnet/ - C# wrapper with P/Invoke
examples/ - Integration examples (KernelMemory, BasicUsage)

See ARCHITECTURE.md for detailed design decisions.

API Reference

PdfExtractor

public class PdfExtractor : IDisposable
{
    // Extract plain text from PDF
    public Task<string> ExtractTextAsync(byte[] pdfBytes);

    // Extract text chunks optimized for RAG/LLM
    public Task<DocumentChunks> ExtractChunksAsync(
        byte[] pdfBytes,
        ChunkOptions options = null
    );

    // Extract metadata (page count, title, author)
    public Task<PdfMetadata> ExtractMetadataAsync(byte[] pdfBytes);
}

ChunkOptions

public class ChunkOptions
{
    public int MaxChunkSize { get; set; } = 512;          // Max tokens per chunk
    public int Overlap { get; set; } = 50;                // Overlap between chunks
    public bool PreserveSentenceBoundaries { get; set; } = true;
    public bool IncludeMetadata { get; set; } = true;
}

DocumentChunk

public class DocumentChunk
{
    public int Index { get; set; }             // Chunk index in document
    public int PageNumber { get; set; }        // Source page number
    public string Text { get; set; }           // Chunk text content
    public double Confidence { get; set; }     // Extraction confidence (0.0-1.0)
    public BoundingBox BoundingBox { get; set; } // Optional spatial info
}

Requirements

.NET 8.0+ (tested on .NET 8, 9)
Native Runtime: Automatically included in NuGet package

Note: .NET 6 support was dropped in v0.2.0 as it reached end-of-support in November 2024. Use v0.1.0 if you still require .NET 6 compatibility.

Building from Source

# Clone repository
git clone https://github.com/bzsanti/oxidize-pdf-dotnet.git
cd oxidize-pdf-dotnet

# Build native library
cd native
cargo build --release

# Build .NET wrapper
cd ../dotnet
dotnet build

# Run tests
dotnet test

Examples

See examples/ directory:

BasicUsage/ - Simple text extraction
KernelMemory/ - Full SharePoint crawler with RAG pipeline

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see LICENSE file.

This is consistent with the underlying oxidize-pdf library which is also licensed under AGPL-3.0.

Key Points:

✅ Free for open-source projects
✅ Commercial use allowed (must share modifications)
⚠️ Network use = distribution (must share source)
⚠️ If you use this in a web service, you must make your code public

For commercial licensing or questions, contact: licensing@belowzero.tech

Contributing

Contributions are welcome! Please read CONTRIBUTING.md for guidelines.

Acknowledgments

Built on top of oxidize-pdf by Santiago Fernández Muñoz.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- System.Text.Json (>= 8.0.5)
net9.0
- System.Text.Json (>= 8.0.5)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.6.0	85	3/21/2026
0.5.0	87	3/18/2026
0.4.0	111	3/15/2026
0.3.1	91	3/9/2026
0.3.0	83	3/6/2026
0.2.2	447	12/10/2025
0.2.1	443	12/8/2025
0.2.0	222	11/4/2025
0.1.0	208	11/4/2025