TurboQuant 0.2.0

.NET 8.0

dotnet add package TurboQuant --version 0.2.0

NuGet\Install-Package TurboQuant -Version 0.2.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="TurboQuant" Version="0.2.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="TurboQuant" Version="0.2.0" />
                    

                            Directory.Packages.props

<PackageReference Include="TurboQuant" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add TurboQuant --version 0.2.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: TurboQuant, 0.2.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package TurboQuant@0.2.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=TurboQuant&version=0.2.0
                    

                            Install as a Cake Addin

#tool nuget:?package=TurboQuant&version=0.2.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

TurboQuant

Near-optimal vector quantization for .NET 8+

Compress embeddings to 2-4 bits with <0.5% quality loss. Zero dependencies. AOT-ready.

Implements Algorithm 1 from TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh et al., ICLR 2026). Lloyd-Max quantization on the exact Beta(d/2, 1/2) distribution with randomized rotation.

768-dim float32 embedding  ──>  TurboQuant 4-bit  ──>  512 bytes
       3072 bytes                                       6x smaller
                                                        cosine > 0.995

Install

dotnet add package TurboQuant

30-second Example

using TurboQuant;

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

var packed = quantizer.Quantize(embedding);        // float[768] -> PackedVector
var restored = quantizer.Dequantize(packed);       // PackedVector -> float[768]
var score = quantizer.ApproxSimilarity(a, b);      // LUT-based, zero allocation

// Zero-alloc dequantize into existing buffer
Span<float> buffer = stackalloc float[768];
quantizer.Dequantize(packed, buffer);

// Serialize for storage
var bytes = packed.ToBytes();
var loaded = PackedVector.FromBytes(bytes);

Why TurboQuant?

	TurboQuant	Product Quantization	Scalar Quantization
Quality (4-bit)	cosine > 0.995	cosine ~ 0.96	cosine ~ 0.98
Quantization speed	~44 us/vec	~500 ms/vec	~1 us/vec
Similarity search	LUT, zero alloc	Asymmetric dist	Full decompress
Needs training data?	No	Yes (codebook training)	No
Online / streaming?	Yes	No (batch only)	Yes
Theory-backed MSE?	Yes (proven optimal)	No	No

Performance

Benchmarked on Apple Silicon (ARM64), .NET 10, Release mode:

Operation	dim=256	dim=768	Heap Alloc
`Quantize`	8 us	57 us	152 B
`Dequantize`	42 us	42 us	1 KB
`ApproxSimilarity`	0.2 us	0.8 us	0 B
`DotProduct` (SIMD)	20 ns	66 ns	0 B
`L2Norm` (SIMD)	13 ns	51 ns	0 B
`CosineSimilarity` (SIMD)	35 ns	126 ns	0 B

SIMD operations use Vector<T> for automatic ISA selection (NEON on ARM, SSE/AVX2/AVX-512 on x86) with dual accumulators to hide data-dependency latency.

ApproxSimilarity uses a precomputed 16x16 lookup table for 4-bit. No unpacking, no centroid lookup. Direct byte-level scan.

QuantizeBatch processes vectors sequentially with zero overhead beyond the per-vector allocation.

Compression Quality

Validated against the paper's theoretical bounds:

Bits	Ratio	D_mse	Cosine	Paper Upper Bound
4-bit	~6-8x	⇐ 0.0095	> 0.995	0.011
3-bit	~8-10x	⇐ 0.035	> 0.98	0.043
2-bit	~12-16x	⇐ 0.118	> 0.93	0.170

Ratios depend on whether the dimension is a power of 2 (no padding needed) or not.

Use Cases

Embedding Compression (OpenAI, Ollama, Azure AI)

var embedding = await embeddingModel.GenerateAsync(["your text"]);
var vector = embedding[0].Vector.ToArray();

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();
var packed = quantizer.Quantize(vector);

// Store packed vectors in your DB, cache, or memory
// Search directly on compressed data (zero allocation):
var score = quantizer.ApproxSimilarity(queryPacked, docPacked);

100K embeddings (768d): 292 MB → 37 MB

RAG Pipeline: Compressed Vector Store

var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();
var store = new Dictionary<string, PackedVector>();

// Index
foreach (var doc in documents)
{
    var emb = await generator.GenerateAsync([doc.Text]);
    store[doc.Id] = quantizer.Quantize(emb[0].Vector.ToArray());
}

// Search (LUT-based, zero allocation)
var queryPacked = quantizer.Quantize(queryEmbedding);
var topK = store
    .Select(kv => (kv.Key, Score: quantizer.ApproxSimilarity(queryPacked, kv.Value)))
    .OrderByDescending(x => x.Score)
    .Take(10);

LLM KV Cache: Extend Context Length

var (cache, window) = TurboQuantBuilder
    .Create(dim: 128)             // head dimension
    .WithKeyBits(4)               // keys: 4-bit (attention precision)
    .WithValueBits(2)             // values: 2-bit (more aggressive)
    .WithResidualWindow(tokens: 128)
    .BuildKVCache();

// Append during generation. Recent tokens stay in float32,
// older tokens are automatically compressed
window.Append(keyVector, valueVector);

// Retrieve for attention computation
var allKeys = window.GetAllKeys();     // recent: exact, older: decompressed
var allValues = window.GetAllValues();

32 layers x 32 heads x 4K tokens x 128d: 32 GB → ~6 GB

Semantic Kernel: Compressed Memory

public class CompressedMemoryStore : IMemoryStore
{
    private readonly TurboQuantMSE _quantizer;
    private readonly ConcurrentDictionary<string, (PackedVector Vec, MemoryRecord Rec)> _data = new();

    public CompressedMemoryStore(int dim) =>
        _quantizer = TurboQuantBuilder.Create(dim).WithBits(4).BuildMSE();

    public Task<string> UpsertAsync(string collection, MemoryRecord record, CancellationToken ct)
    {
        var packed = _quantizer.Quantize(record.Embedding.Vector.ToArray());
        _data[record.Metadata.Id] = (packed, record);
        return Task.FromResult(record.Metadata.Id);
    }

    public Task<MemoryRecord?> GetNearestMatchAsync(
        string collection, ReadOnlyMemory<float> query, double minScore, CancellationToken ct)
    {
        var qPacked = _quantizer.Quantize(query.Span);
        return Task.FromResult(_data.Values
            .Select(d => (d.Rec, Score: (double)_quantizer.ApproxSimilarity(qPacked, d.Vec)))
            .Where(x => x.Score >= minScore)
            .MaxBy(x => x.Score).Rec);
    }
}

ML.NET Pipeline

var quantizer = TurboQuantBuilder.Create(dim: 384).WithBits(4).BuildMSE();

var pipeline = mlContext.Transforms.CustomMapping<Input, Output>(
    (input, output) =>
    {
        var packed = quantizer.Quantize(input.Features);
        output.Compressed = quantizer.Dequantize(packed);
    },
    contractName: "TurboQuant");

ONNX Runtime: Post-Inference Compression

using var session = new InferenceSession("embedding_model.onnx");
var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

using var results = session.Run(inputs);
var outputTensor = results.First().AsEnumerable<float>().ToArray();

var compressed = new PackedVector[batchSize];
quantizer.QuantizeBatch(outputTensor, compressed);

Serialization

// Save
var bytes = packed.ToBytes();
File.WriteAllBytes("vector.bin", bytes);

// Load
var loaded = PackedVector.FromBytes(File.ReadAllBytes("vector.bin"));
var restored = quantizer.Dequantize(loaded);

Rotation Strategies

TurboQuant defaults to random orthogonal rotation (QR decomposition of a Gaussian matrix), which matches the paper exactly and works with any dimension without overhead.

For power-of-2 dimensions (128, 256, 512, 1024), the Hadamard rotation is an excellent alternative: identical quality, O(d log d) time, and O(d) memory instead of O(d^2).

For non-power-of-2 dimensions (384, 768, 1536), Hadamard pads internally to the next power of 2, which increases packed data size by up to 33%. A Trace.TraceWarning is emitted when this happens.

// Default: random orthogonal (paper-correct, any dimension)
var quantizer = TurboQuantBuilder.Create(dim: 768).WithBits(4).BuildMSE();

// Fast path for power-of-2 dimensions (recommended for KV cache with head dim 64/128)
var fast = TurboQuantBuilder.Create(dim: 1024).WithBits(4).WithHadamardRotation().BuildMSE();

	Random Orthogonal (default)	Hadamard
Time	O(d^2)	O(d log d)
Memory	O(d^2) per quantizer	O(d) per quantizer
Packed size	Exact (d coordinates)	Exact if power-of-2, padded otherwise
dim=768 quantize	~57 us	~1.3 us (but 33% more storage)
dim=1024 quantize	~100 us	~1.6 us (no padding)
D_mse	= theoretical	= theoretical (power-of-2), better (non-power-of-2)

Best practice: if your embedding model supports configurable dimensions (e.g. OpenAI, nomic), choose a power-of-2 dimension (512, 1024) to get the best of both strategies.

Note on seed correlation: the rotation matrix is generated from a random seed. If your test vectors are generated from the same seed (e.g. new Random(42) for both), the rotation and data become correlated, causing degraded quality. This does not happen in practice (embeddings come from a model, not from Random), but use distinct seeds in synthetic benchmarks.

How It Works

                    ┌──────────────┐
   float[768]  ──>  │  Normalize   │ store ||x||
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │   Rotate     │ random orthogonal (default) or Hadamard
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │  Quantize    │ per-coordinate Lloyd-Max on Beta(d/2, 1/2)
                    └──────┬───────┘
                           v
                    ┌──────────────┐
                    │  Bit-Pack    │ 4-bit: 2 values/byte, 3-bit: 8 values/3 bytes
                    └──────┬───────┘
                           v
                    PackedVector (data + norm)

The rotation ensures each coordinate is approximately Beta-distributed regardless of input, enabling provably near-optimal scalar quantization. D_mse is within factor sqrt(3)pi/2 ~ 2.7x of the information-theoretic lower bound (Theorem 3 in the paper).

ApproxSimilarity computes cosine similarity directly on packed bytes via a precomputed 16x16 centroid dot-product lookup table (4-bit). No unpacking, no centroid lookup, zero heap allocation.

Project Layout

src/TurboQuant/
  Core/
    Rotation/       RandomRotation (default, paper-correct), HadamardRotation (fast, power-of-2)
    Codebook/       BetaCodebook (Lloyd-Max on exact Beta distribution, with LUT)
    Packing/        BitPacker (2/3/4-bit), PackedVector (serializable, immutable)
    Quantizers/     TurboQuantMSE, TurboQuantAsymmetric
    Simd/           DotProductSimd (Vector<T> auto-ISA: NEON/SSE/AVX2/AVX-512)
  Cache/            KVCache, ResidualWindow
  Diagnostics/      CompressionStats (thread-safe), QuantizationBenchmark
  TurboQuantBuilder.cs

Build & Test

dotnet build TurboQuantNet.slnx
dotnet test
dotnet run --project samples/TurboQuant.Demo
dotnet run -c Release --project benchmarks/TurboQuant.Benchmarks

License

MIT. Based on the paper by Zandieh, Daliri, Hadian & Mirrokni (Google Research / NYU / Google DeepMind).

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- No dependencies.
net8.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.2.0	100	4/12/2026
0.1.1	105	4/3/2026