HeroParser 2.4.1

dotnet add package HeroParser --version 2.4.1
                    
NuGet\Install-Package HeroParser -Version 2.4.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="HeroParser" Version="2.4.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="HeroParser" Version="2.4.1" />
                    
Directory.Packages.props
<PackageReference Include="HeroParser" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add HeroParser --version 2.4.1
                    
#r "nuget: HeroParser, 2.4.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package HeroParser@2.4.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=HeroParser&version=2.4.1
                    
Install as a Cake Addin
#tool nuget:?package=HeroParser&version=2.4.1
                    
Install as a Cake Tool

HeroParser - High-Performance, AI-Native CSV, Fixed-Width, Excel (.xlsx), JSONL & HTB Parser

Build and Test NuGet License: MIT

HeroParser is a zero-allocation, SIMD-accelerated tabular data parser and writer for .NET 8, 9, and 10. Designed for extreme speed, memory efficiency, and Native AOT compatibility, it also offers first-class integrations for AI agents, vector embeddings, and LLM pipelines.

Why Choose HeroParser?

  • Extreme Performance: Engineered with AVX-512, AVX2, and ARM NEON SIMD optimizations to deliver ultra-high-throughput reading and writing.
  • AI-Native integrations: Built-in support for token-budgeted chunking, LLM output structured repair, vector embedding pipelines, and agent tool mapping.
  • Zero Dependencies & Low Footprint: Operates with zero external packages. Employs a fixed 112-byte heap memory footprint on the reading hot-path regardless of file size.
  • Unified Attributes: Annotate your C# classes once, and use them across CSV, Excel, Fixed-Width, and HTB APIs.

Performance & Memory

Tested under .NET 10.0 on an AMD Ryzen AI 9 HX PRO 370 CPU:

  • Read Throughput: SIMD-accelerated UTF-8 (byte[]) read paths on both quoted and unquoted data.
  • Write Throughput: Highly optimized CSV/JSONL serialization achieving massive throughput.
  • GC Allocations: Fixed 112-byte allocation throughout parsing, representing a 97% memory reduction compared to traditional reflection-based parsers.
  • String Generation: Up to 64% speedup on synchronous text generation via pre-allocated capacities.

View live performance graphs and history on the HeroParser Performance Portal.


Install

dotnet add package HeroParser

Simple Quick Starts

Define your record type. Decorate it with [GenerateBinder] to enable source-generated, reflection-free, and Native AOT-safe binding:

using HeroParser;

[GenerateBinder]
public class Product
{
    public int Id { get; set; }
    public string Name { get; set; } = "";
    
    [Validate(RangeMin = 0)]
    public decimal Price { get; set; }
    
    [Parse(Format = "yyyy-MM-dd")]
    public DateTime ReleaseDate { get; set; }
}

CSV

// Read a file in one line (zero-allocation binding)
List<Product> products = Csv.Read<Product>().FromFile("products.csv").ToList();

// Async stream millions of rows without buffering
await foreach (Product p in Csv.Read<Product>().FromFileAsync("products.csv"))
{
    Console.WriteLine($"{p.Name}: {p.Price:C}");
}

// Write collection to a file
Csv.Write<Product>().ToFile("out.csv", products);

Excel (.xlsx)

Reads and writes Excel workbooks with zero external dependencies (utilizes only standard .NET compression and XML packages).

// Read Excel files
List<Product> products = Excel.Read<Product>().FromFile("products.xlsx");

// Read specific sheet
var sheetData = Excel.Read<Product>().FromSheet("Sales").FromFile("workbook.xlsx");

// Write Excel workbook
Excel.Write<Product>().ToFile("out.xlsx", products);

Fixed-Width

Map properties to specific character boundaries using the [PositionalMap] attribute:

[GenerateBinder]
public class Employee
{
    [PositionalMap(Start = 0, Length = 10)]
    public string Id { get; set; } = "";

    [PositionalMap(Start = 10, Length = 30)]
    public string Name { get; set; } = "";

    [PositionalMap(Start = 40, Length = 10, Alignment = FieldAlignment.Right)]
    public decimal Salary { get; set; }
}
// Read positional records
var employees = FixedWidth.Read<Employee>().FromFile("employees.dat").Records;

// Write positional records
FixedWidth.Write<Employee>().ToFile("out.dat", employees);

JSON Lines (JSONL)

Perfect for AI fine-tuning datasets, streamed LLM responses, and database bulk loading.

// Read JSONL files (AOT-safe)
var records = Jsonl.Read<Product>().FromFile("products.jsonl").ToList();

// Write JSONL files
Jsonl.Write<Product>().ToFile("out.jsonl", records);

High-Throughput Tabular Binary (HTB)

A custom, high-speed binary serialization format optimized for zero allocations, platform independence, and vector embedding storage (supporting float[] arrays).

Why HTB?
  • Zero Heap Allocations: Utilizes Roslyn source generators ([GenerateBinder]) to map properties directly with zero-boxing and zero-reflection overhead.
  • Vector Embedding Native: Natively supports floating-point arrays (float[]), enabling ultra-fast vector embedding serialization without string parsing overhead.
  • Platform-Independent Endianness: Automatically handles big-endian byte-order reversal for floats, doubles, and ints for cross-architecture safety.
  • Allocation-Free CSV ↔ HTB Conversion: Stream-convert CSV directly to HTB (and vice-versa) with zero heap allocation overhead.
  • AOT & Trim Ready: Fully compatible with Native AOT compilation out-of-the-box.
// Read HTB binary files (AOT-safe)
List<Product> products = Htb.Read<Product>().FromFile("products.htb").ToList();

// Async stream HTB records
await foreach (Product p in Htb.Read<Product>().FromFileAsync("products.htb"))
{
    Console.WriteLine($"{p.Name}: {p.Price:C}");
}

// Write records to an HTB file
Htb.Write<Product>().ToFile("out.htb", products);

// Direct, allocation-free CSV ↔ HTB conversions
Htb.ConvertFromCsv("products.csv", "products.htb", HtbSchema.FromType<Product>());

Unified Attribute System

Annotate a single record class once, and read or write it across multiple formats:

Attribute Purpose CSV Excel Fixed-Width HTB
[GenerateBinder] Emits Roslyn source-generated, reflection-free mapping binder Yes Yes Yes Yes
[TabularMap(Name, Index)] Maps property to column header or index Yes Yes No Yes
[PositionalMap(Start, Length...)] Declares character position, alignment, and pad characters No No Yes No
[Parse(Format)] Converts raw values to custom types (e.g. DateTime format) Yes Yes Yes No
[Format(WriteFormat...)] Customizes output formatting during serialization Yes Yes Yes No
[Validate(Range, Pattern...)] Validates properties bidirectionally (Strict/Lenient modes) Yes Yes Yes Yes

AI-Native Tabular Capabilities

HeroParser includes first-class support for LLM, vector search, and RAG pipelines:

1. LLM Structured Output Repair (LlmRepair)

Repairs truncated final rows (unclosed quotes/escapes) and strips markdown code blocks from raw LLM text streams.

using HeroParser.AI;

// Repaired on-the-fly and parsed directly into strongly-typed records
await foreach (var dev in LlmRepair.ReadFromTextAsync<Developer>(rawLlmResponse))
{
    Console.WriteLine($"{dev.Name} is a {dev.Role}");
}

2. Tabular Embedding Pipeline (ToLlmEmbeddingsAsync)

Batches streamed records and pairs them with vector embeddings with a zero-allocation, token-budgeted streaming wrapper.

using HeroParser.AI;

await foreach (var chunk in developers.ToLlmEmbeddingsAsync(
    async (texts, ct) => await GetEmbeddingsFromApiAsync(texts, ct),
    options: new LlmChunkOptions { MaxTokensPerChunk = 250 },
    batchSize: 16))
{
    Console.WriteLine($"Chunk of {chunk.Chunk.EndRow - chunk.Chunk.StartRow + 1} rows embedded.");
}

3. Agent Tool Argument Mapper (SchemaMetadata.MapFromToolCall)

Maps flat dictionaries of case-insensitive arguments returned by tool calling models into typed record models, executing validation constraints and raising rich validation feedback.

using HeroParser.AI;

Developer dev = SchemaMetadata.MapFromToolCall<Developer>(arguments);

4. Tabular Context Profiler (TabularContextProfiler)

Generates structured statistical profile cards in markdown directly from datasets to inject into LLM system prompts.

string contextCard = developers.GenerateContextCard(datasetName: "Engineering Team");

5. Token-Bounded JSON Chunker (JsonLlmChunker)

Chunks datasets into valid, token-bounded JSON array blocks, optimized for ingestion into RAG context windows.

await foreach (var jsonChunk in developers.ToJsonLlmChunksAsync(options)) { ... }

Key Features

  • SIMD-accelerated CSV parsing — AVX-512, AVX2, and ARM NEON instruction sets; PCLMULQDQ-based branchless quote tracking
  • Zero allocations — fixed 4 KB stack footprint regardless of column count or file size; ArrayPool for buffers
  • AOT/trimming ready — source generators emit reflection-free binders; annotated with [RequiresUnreferencedCode] where reflection is unavoidable
  • Async streamingIAsyncEnumerable<T> for CSV, Fixed-Width, Excel, JSONL, and HTB; true non-blocking I/O with sync fast paths
  • Excel without extra dependencies — reads and writes .xlsx using only System.IO.Compression and System.Xml
  • JSONL for AI/ML pipelinesJsonl.Read<T>() / Jsonl.Write<T>() mirror the CSV builder pattern; AOT-safe via JsonTypeInfo<T>; CsvToJsonlConverter projects tabular data into OpenAI/Anthropic fine-tuning shapes
  • HTB binary format — custom, high-speed, zero-allocation binary format featuring float-array embedding support, big-endian byte-order reversal, and direct CSV ↔ HTB conversion
  • Embedding-API batchingIAsyncEnumerable<T>.BatchAsync(size) groups streamed records into fixed-size batches for OpenAI/Voyage/Cohere/Anthropic embedding calls
  • Inline vector parserVectorParser.ParseFloats(span) handles pre-computed embeddings ("[0.1,0.2,…]", comma/semicolon/whitespace separators, culture-aware)
  • DataReader supportCsv.CreateDataReader(), FixedWidth.CreateDataReader(), Excel.CreateDataReader(), Jsonl.CreateDataReader() for database bulk loading via SqlBulkCopy
  • PipeReader integrationCsv.ReadFromPipeReaderAsync(pipe) for network streaming without buffering the entire payload
  • Multi-schema CSV — discriminator-based row routing to different record types; source-generated dispatch for ~2.85x faster throughput
  • Delimiter detection — auto-detect comma, semicolon, pipe, or tab from sample rows with a confidence score
  • CSV validation — pre-flight structural checks with detailed per-row error reporting
  • Field validation[Validate] constraints (NotNull, NotEmpty, Range, Pattern) collected lazily; result.ThrowIfAnyError() for fail-fast
  • CSV injection protection — configurable sanitization modes for user-data exports
  • Progress reporting — row/byte callbacks for large-file UX
  • Custom type converters — register converters for domain types on any reader or writer
  • Write capacity pre-allocation — backing buffer capacity is pre-allocated via estimated record counts when writing collections to strings, yielding up to 64% speedup and completely eliminating buffer resize copy overhead
  • Multi-framework — .NET 8, 9, 10; CI validates all three on Windows, Linux, and macOS

Detailed Documentation

For advanced features and full API guides, see the files under the docs folder:

  • CSV Guide — Fluent readers/writers, validation, PipeReader, and multi-schema dispatching.
  • Excel Guide — Multi-sheet workbooks, custom formatting, and progress tracking.
  • Fixed-Width Guide — Positional mapping, alignment, padding, and custom type converters.
  • JSONL Guide — Fine-tuning templates, vector parsing, and Native AOT setups.
  • HTB Guide — High-Throughput Tabular Binary format, fluent APIs, CSV ↔ HTB conversion, and Native AOT support.
  • Benchmarks Guide — Execution environments, detailed CPU metrics, and comparisons.

Building & Testing

# Build all projects
dotnet build

# Run unit tests
dotnet test --filter Category=Unit

# Run integration tests
dotnet test --filter Category=Integration

# Run all tests
dotnet test

# Check code formatting
dotnet format --verify-no-changes

# Run benchmarks
dotnet run -c Release --project benchmarks/HeroParser.Benchmarks

License

MIT

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
2.4.1 0 6/6/2026
2.4.0 87 5/29/2026
2.3.0 104 5/28/2026
2.2.1 94 5/26/2026
2.2.0 98 5/23/2026
2.1.3 124 4/24/2026
2.1.2 156 3/17/2026
2.1.1 121 3/16/2026
2.1.0 125 3/16/2026
2.0.0 121 3/14/2026
1.7.1 125 3/13/2026
1.7.0 112 3/12/2026
1.6.4 127 3/11/2026
1.6.3 130 1/13/2026
1.6.2 128 1/12/2026
1.6.1 127 1/10/2026
1.6.0 126 1/9/2026
1.5.4 127 12/29/2025
1.5.3 137 12/29/2025
1.5.2 121 12/27/2025
Loading failed

v2.4.1 delivers comprehensive security hardening, streaming reliability improvements, and zero-allocation parser optimizations: global XML entity expansion / Zip-bomb mitigations (XXE prevention) in Excel sheet reading, thread-safe cached Regex and ReDoS timeout protections in LLM validation, token-aligned PipeReader CRLF boundary splits safeguards, HTB float array endianness fixes, and AOT-safe progressive JSONL stream enhancements. Full changelog: https://github.com/KoalaFacts/HeroParser/blob/main/CHANGELOG.md