OxidizePdf.NET
0.3.0
See the version list below for details.
dotnet add package OxidizePdf.NET --version 0.3.0
NuGet\Install-Package OxidizePdf.NET -Version 0.3.0
<PackageReference Include="OxidizePdf.NET" Version="0.3.0" />
<PackageVersion Include="OxidizePdf.NET" Version="0.3.0" />
<PackageReference Include="OxidizePdf.NET" />
paket add OxidizePdf.NET --version 0.3.0
#r "nuget: OxidizePdf.NET, 0.3.0"
#:package OxidizePdf.NET@0.3.0
#addin nuget:?package=OxidizePdf.NET&version=0.3.0
#tool nuget:?package=OxidizePdf.NET&version=0.3.0
OxidizePdf.NET
.NET bindings for oxidize-pdf - Fast, memory-safe PDF text extraction optimized for RAG/LLM pipelines with intelligent chunking.
Features
- 🚀 High Performance - Native Rust speed (3,000-4,000 pages/second)
- 🧠 AI/RAG Optimized - Intelligent text chunking with sentence boundaries
- 🛡️ Memory Safe - Zero-copy FFI with automatic resource management
- 🌍 Cross-Platform - Linux, Windows, macOS (x64)
- 📦 Zero Dependencies - Self-contained native binaries in NuGet package
- 🔍 Metadata Rich - Page numbers, confidence scores, bounding boxes
Installation
dotnet add package OxidizePdf.NET
Quick Start
Basic Text Extraction
using OxidizePdf.NET;
// Extract all text from PDF
using var extractor = new PdfExtractor();
byte[] pdfBytes = File.ReadAllBytes("document.pdf");
string text = await extractor.ExtractTextAsync(pdfBytes);
Console.WriteLine(text);
AI/RAG Integration with KernelMemory
using OxidizePdf.NET;
using Microsoft.KernelMemory;
var extractor = new PdfExtractor();
var memory = new KernelMemoryBuilder().Build();
// Extract chunks optimized for embeddings
var chunks = await extractor.ExtractChunksAsync(
pdfBytes,
new ChunkOptions
{
MaxChunkSize = 512, // Token limit for embedding model
Overlap = 50, // Context overlap between chunks
PreserveSentenceBoundaries = true, // No mid-sentence cuts
IncludeMetadata = true // Page numbers, confidence scores
}
);
// Store in vector database
foreach (var chunk in chunks)
{
await memory.ImportTextAsync(
text: chunk.Text,
documentId: $"doc_{chunk.PageNumber}_{chunk.Index}",
tags: new Dictionary<string, object>
{
["source"] = "SharePoint/Documents/report.pdf",
["page"] = chunk.PageNumber,
["confidence"] = chunk.Confidence
}
);
}
SharePoint Crawler Example
using OxidizePdf.NET;
using Microsoft.Graph;
var extractor = new PdfExtractor();
var graphClient = new GraphServiceClient(...);
// Crawl SharePoint document library
var driveItems = await graphClient.Sites["root"]
.Drives["Documents"]
.Root
.Children
.Request()
.Filter("endsWith(name,'.pdf')")
.GetAsync();
foreach (var item in driveItems)
{
var stream = await graphClient.Sites["root"]
.Drives["Documents"]
.Items[item.Id]
.Content
.Request()
.GetAsync();
using var ms = new MemoryStream();
await stream.CopyToAsync(ms);
var chunks = await extractor.ExtractChunksAsync(ms.ToArray());
// Process chunks for embeddings...
}
Performance
Based on oxidize-pdf v1.6.4 benchmarks:
- Text Extraction: 3,000-4,000 pages/second
- Chunking: 0.62ms for 100 pages
- Memory Overhead: <1MB per document
- PDF Parsing: 98.8% success rate on 759 real-world PDFs
Supported Platforms
| Platform | Runtime Identifier | Status |
|---|---|---|
| Linux x64 | linux-x64 |
✅ Supported |
| Windows x64 | win-x64 |
✅ Supported |
| macOS x64 | osx-x64 |
✅ Supported |
Native binaries are automatically included in the NuGet package.
Architecture
- native/ - Rust FFI layer (cdylib)
- dotnet/ - C# wrapper with P/Invoke
- examples/ - Integration examples (KernelMemory, BasicUsage)
See ARCHITECTURE.md for detailed design decisions.
API Reference
PdfExtractor
public class PdfExtractor : IDisposable
{
// Extract plain text from PDF
public Task<string> ExtractTextAsync(byte[] pdfBytes);
// Extract text chunks optimized for RAG/LLM
public Task<DocumentChunks> ExtractChunksAsync(
byte[] pdfBytes,
ChunkOptions options = null
);
// Extract metadata (page count, title, author)
public Task<PdfMetadata> ExtractMetadataAsync(byte[] pdfBytes);
}
ChunkOptions
public class ChunkOptions
{
public int MaxChunkSize { get; set; } = 512; // Max tokens per chunk
public int Overlap { get; set; } = 50; // Overlap between chunks
public bool PreserveSentenceBoundaries { get; set; } = true;
public bool IncludeMetadata { get; set; } = true;
}
DocumentChunk
public class DocumentChunk
{
public int Index { get; set; } // Chunk index in document
public int PageNumber { get; set; } // Source page number
public string Text { get; set; } // Chunk text content
public double Confidence { get; set; } // Extraction confidence (0.0-1.0)
public BoundingBox BoundingBox { get; set; } // Optional spatial info
}
Requirements
- .NET 8.0+ (tested on .NET 8, 9)
- Native Runtime: Automatically included in NuGet package
Note: .NET 6 support was dropped in v0.2.0 as it reached end-of-support in November 2024. Use v0.1.0 if you still require .NET 6 compatibility.
Building from Source
# Clone repository
git clone https://github.com/bzsanti/oxidize-pdf-dotnet.git
cd oxidize-pdf-dotnet
# Build native library
cd native
cargo build --release
# Build .NET wrapper
cd ../dotnet
dotnet build
# Run tests
dotnet test
Examples
See examples/ directory:
- BasicUsage/ - Simple text extraction
- KernelMemory/ - Full SharePoint crawler with RAG pipeline
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) - see LICENSE file.
This is consistent with the underlying oxidize-pdf library which is also licensed under AGPL-3.0.
Key Points:
- ✅ Free for open-source projects
- ✅ Commercial use allowed (must share modifications)
- ⚠️ Network use = distribution (must share source)
- ⚠️ If you use this in a web service, you must make your code public
For commercial licensing or questions, contact: licensing@belowzero.tech
Contributing
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
Acknowledgments
Built on top of oxidize-pdf by Santiago Fernández Muñoz.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- System.Text.Json (>= 8.0.5)
-
net9.0
- System.Text.Json (>= 8.0.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.