Llmdot.Cli
0.1.0
dotnet tool install --global Llmdot.Cli --version 0.1.0
dotnet new tool-manifest
dotnet tool install --local Llmdot.Cli --version 0.1.0
#tool dotnet:?package=Llmdot.Cli&version=0.1.0
nuke :add-package Llmdot.Cli --version 0.1.0
llmdot
Run local GGUF language models from .NET — one package, one format, one programming model.
A .NET-native local inference runtime for GGUF language models. CPU-first. Managed-by-default. Idiomatic. Trimming- and NativeAOT-friendly.
Vision • Architecture • Roadmap • Platform Strategy • Model Reference
What is llmdot
llmdot is a native .NET runtime for local language model inference built around the GGUF model format. It executes major decoder-only transformer and hybrid architectures in the 1–8B parameter range — including multimodal variants — through architecture-agnostic execution templates resolved from GGUF metadata at load time.
The project is designed around a single opinionated goal: make local LLM execution in .NET as simple as adding a NuGet package, loading a GGUF file, and streaming tokens. The default path is pure managed code with zero native runtime dependencies, focused on CPU-first execution. Optional packages provide GPU acceleration through thin backend adapters.
using Llmdot;
await using var model = await LlmModel.LoadAsync("phi-3-mini-q4_k_m.gguf");
await using var session = model.CreateChatSession();
await foreach (var token in session.StreamAsync("Explain GGUF in one paragraph."))
Console.Write(token);
The code sample above reflects the target API shape. See roadmap for the current state of the implementation.
Why llmdot
The .NET inference landscape today forces developers into one of two uncomfortable tradeoffs:
| Option | Strength | Tradeoff |
|---|---|---|
llama.cpp bindings |
Broad model support | Native binaries, per-platform packaging, upstream integration debt |
| ONNX-based stacks | Strong hardware acceleration | Model conversion, large native dependencies, toolchain friction |
llmdot takes a third position:
- GGUF-native execution with no conversion pipeline
- Pure managed core — trimming-friendly, NativeAOT-friendly, single-file publish-friendly
- Idiomatic .NET APIs built for
IAsyncEnumerable<T>streaming, DI, andMicrosoft.Extensions.Hosting - Config-driven architectures — new model families plug into existing execution templates with zero engine code
- Focus on the common case — small-to-mid quantized models where developer experience beats peak throughput
Who llmdot is for
.NET developers who want to ship local, private, or offline AI features without fighting the inference stack.
Software architects evaluating local LLM runtimes for desktop, edge, and server workloads where packaging simplicity, deployment predictability, and platform portability matter as much as raw throughput.
Teams building on Microsoft.Extensions.AI who need an IChatClient-compatible backend that runs fully in-process, with no sidecar services and no native toolchain.
If you have ever thought "I just want to load a GGUF file in my ASP.NET Core app and stream tokens" — llmdot is built for you.
Design Principles
- Zero native dependencies in the core path. The default install is pure managed .NET. Native acceleration is always additive.
- GGUF is the ingestion format. No ONNX conversion. No proprietary packaging. Community models work out of the box.
- Architecture support is declarative, not hard-coded. New model families are resolved through
TransformerConfigfrom GGUF metadata. - Model compatibility is decoupled from hardware backend. CPU, Vulkan, or Metal — same model, same code.
- Optimize for the common case. 1–8B quantized models on consumer hardware. Small enough to fit, big enough to matter.
- Incremental acceleration. Backends offload individual operations, not entire graphs. No all-or-nothing rewrites.
Supported Architectures
All supported architectures collapse into four execution templates. Within each template, all variation is expressed through configuration — no per-model code paths.
| Template | Architectures | Example Models |
|---|---|---|
| LLaMA-like (sequential pre-norm) | llama, phi3, qwen2, stablelm, mistral |
LLaMA-3.2, Qwen-2, Phi-3, Mistral-7B, StableLM-2 |
| GPT-NeoX-like (parallel residual) | gptneox, phi2 |
Pythia, Phi-2 |
| Gemma-like (embedding scaling + post-norm) | gemma, gemma2 |
Gemma 2B, Gemma-2 2B/9B |
| LFM2-like (hybrid convolution-attention) | lfm2, lfm2_moe |
LFM2 350M–2.6B, LFM2-VL, LFM2-8B-A1B |
Multimodal variants (vision-language via SigLIP2, speech via FastConformer/Mimi) plug in as modality encoders on top of the base LLM backbone — the core runtime is unchanged.
See doc/model-architectures.md for the full reference.
Architecture at a Glance
┌────────────────────────────────────────────────────────────────┐
│ Application (ASP.NET Core, desktop, CLI, worker service) │
└──────────────┬─────────────────────────────────────────────────┘
│ IChatClient / IAsyncEnumerable<string>
┌──────────────▼─────────────────────────────────────────────────┐
│ Llmdot.Extensions.AI (Microsoft.Extensions.AI integration) │
├────────────────────────────────────────────────────────────────┤
│ Llmdot.Core │
│ ┌───────────┐ ┌─────────────────┐ ┌──────────────────────┐ │
│ │ GGUF │ │ Architecture │ │ Sampling & Tokenizer │ │
│ │ Loader │─▶│ Resolver │─▶│ │ │
│ └───────────┘ └────────┬────────┘ └──────────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Model Graph │ 4 execution templates │
│ │ + KV / Conv │ resolved from config │
│ │ State │ │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Tensor Runtime │ managed kernels, │
│ │ (IComputeBackend)│ Span<T>, intrinsics │
│ └────────┬────────┘ │
└──────────────────────────┼─────────────────────────────────────┘
▼
┌────────────┴────────────┐
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ CPU │ │ Optional │
│ (default, │ │ GPU: Vulkan │
│ managed) │ │ / Metal / │
│ │ │ CUDA │
└───────────┘ └─────────────┘
The model graph reads only from a resolved TransformerConfig — never from raw GGUF keys directly. This is the central abstraction that eliminates per-architecture code paths. See doc/architecture.md.
Project Goals
- Load and execute supported GGUF models directly from .NET
- Cover all major 1–8B decoder architectures via the four execution templates
- Support small multimodal models (vision-language, audio) through pluggable modality encoders
- Provide a clean chat and text-generation API with async streaming and cancellation
- Integrate naturally with
Microsoft.Extensions.AIabstractions - Deliver strong CPU performance for quantized small-to-mid-sized models
- Offer optional GPU compute backends (Vulkan, Metal) without coupling model support to a vendor format
Non-Goals
- Be the fastest inference engine on every hardware target
- Replace vendor-optimized GPU runtimes for large-scale serving
- Require ONNX conversion or proprietary model packaging
- Target frontier-scale (70B+) models as an early milestone
- Accelerate via NPU — NPUs are graph compilers, not programmable compute. See architecture.md for the reasoning.
Packaging
| Package | Purpose | Dependencies |
|---|---|---|
Llmdot.Core |
GGUF loader, model graph, CPU backend, sampling, tokenizer | Pure managed .NET |
Llmdot.Extensions.AI |
IChatClient + Microsoft.Extensions.AI integration |
Llmdot.Core |
Llmdot.Backends.Vulkan (planned) |
Vulkan compute acceleration | Native Vulkan loader |
Llmdot.Backends.Metal (planned) |
Metal compute acceleration (Apple Silicon) | Native Metal |
Llmdot.Multimodal.Vision (planned) |
SigLIP2 vision encoder + connector | Llmdot.Core |
The core runtime is the single required dependency. Everything else is additive and opt-in.
Repository Layout
llmdot/
├── src/
│ ├── Llmdot.Core/ Core runtime: GGUF loader, graph, CPU backend
│ └── Llmdot.Extensions.AI/ Microsoft.Extensions.AI integration
├── samples/
│ └── Llmdot.Sample/ Minimal end-to-end example
├── tests/
│ └── Llmdot.Core.Tests/ Unit and integration tests
├── benches/
│ └── Llmdot.Benchmarks/ BenchmarkDotNet performance suite
└── doc/ Vision, architecture, roadmap, platform strategy
Target frameworks: net8.0, net9.0, net10.0. Nullable enabled, warnings-as-errors, LangVersion=13.0.
Status
Pre-alpha. The specification, architecture, and execution template design are stable. Implementation is in active development. Do not use in production yet.
Initial release targets:
- Architecture and execution template design
- GGUF loader (header, metadata, tensors, tokenizer assets)
-
TransformerConfigresolver across all four templates - CPU reference backend with quantized kernels
- LLaMA-like template end-to-end
- Token streaming via
IAsyncEnumerable<T> -
IChatClientintegration - Remaining three execution templates
- Optional GPU backends
Track progress in doc/roadmap.md.
Contributing
This is an early-stage project and design feedback is welcome. Please read the vision and architecture documents before opening an issue — most "why not X?" questions have explicit answers there (especially around ONNX, NPU, and native wrapping).
Contribution areas most valuable right now:
- GGUF quantization format coverage
- Managed kernel optimization (intrinsics, vectorization)
- Tokenizer correctness across BPE variants
- Test fixtures for additional model families
Guiding Principle
llmdotshould aim to become the easiest way to run community GGUF models from .NET:
- one core package to get started
- one model format
- one programming model
Performance still matters, but friction reduction is the primary product advantage.
License
MIT. See LICENSE for details.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
This package has no dependencies.
| Version | Downloads | Last Updated |
|---|---|---|
| 0.1.0 | 105 | 4/18/2026 |