ElBruno.LocalLLMs 0.16.0

.NET 8.0

dotnet add package ElBruno.LocalLLMs --version 0.16.0

NuGet\Install-Package ElBruno.LocalLLMs -Version 0.16.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="ElBruno.LocalLLMs" Version="0.16.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="ElBruno.LocalLLMs" Version="0.16.0" />
                    

                            Directory.Packages.props

<PackageReference Include="ElBruno.LocalLLMs" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add ElBruno.LocalLLMs --version 0.16.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: ElBruno.LocalLLMs, 0.16.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package ElBruno.LocalLLMs@0.16.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=ElBruno.LocalLLMs&version=0.16.0
                    

                            Install as a Cake Addin

#tool nuget:?package=ElBruno.LocalLLMs&version=0.16.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

ElBruno.LocalLLMs

Run local LLMs in .NET through IChatClient 🧠

Run local LLMs in .NET through IChatClient — the same interface you'd use for Azure OpenAI, Ollama, or any other provider. Powered by ONNX Runtime GenAI and BitNet.

Features

🔌 IChatClient implementation — seamless integration with Microsoft.Extensions.AI
📦 Automatic model download — models are fetched from HuggingFace on first use
🚀 Zero friction — works out of the box with sensible defaults (Phi-3.5 mini)
🖥️ Multi-hardware — CPU, CUDA, and DirectML execution providers
💉 DI-friendly — register with AddLocalLLMs() or AddBitNetChatClient() in ASP.NET Core
🔄 Streaming — token-by-token streaming via GetStreamingResponseAsync
📊 Multi-model — switch between Phi-3.5, Phi-4, Qwen2.5, Llama 3.2, and more
🎯 Fine-tuned models — pre-trained Qwen2.5 variants for tool calling and RAG (guide)
⚡ BitNet support — run 1.58-bit ternary models via bitnet.cpp with extreme efficiency (guide)

Packages

Package	NuGet	Downloads	Description
`ElBruno.LocalLLMs`			Core library — ONNX Runtime GenAI models via IChatClient
`ElBruno.LocalLLMs.Rag`			RAG pipeline — document chunking, indexing, retrieval
`ElBruno.LocalLLMs.BitNet`			BitNet 1.58-bit models via bitnet.cpp + IChatClient

Installation

dotnet add package ElBruno.LocalLLMs

Then add one runtime package depending on your target hardware:

# 🖥️ CPU (works everywhere — required for CPU-only apps):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI

# 🟢 NVIDIA GPU (CUDA):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda

# 🔵 Any Windows GPU — AMD, Intel, NVIDIA (DirectML):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML

⚠️ Add exactly one runtime package. Do not reference both Microsoft.ML.OnnxRuntimeGenAI and Microsoft.ML.OnnxRuntimeGenAI.Cuda simultaneously — the native binaries conflict and GPU support will silently fail.

🚀 The library defaults to ExecutionProvider.Auto — it tries GPU first and falls back to CPU automatically. No code changes needed.

Quick Start

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Create a local chat client (downloads Phi-3.5 mini on first run)
using var client = await LocalChatClient.CreateAsync();

var response = await client.GetResponseAsync([
    new(ChatRole.User, "What is the capital of France?")
]);

Console.WriteLine(response.Text);

First Run

The first time you create a LocalChatClient, the model is downloaded from HuggingFace to your local cache directory (~2-4 GB). This typically takes 30-60 seconds depending on your internet connection.

Track download progress:

using var client = await LocalChatClient.CreateAsync(
    new LocalLLMsOptions { Model = KnownModels.Phi35MiniInstruct },
    progress: new Progress<ModelDownloadProgress>(p =>
    {
        var percent = (p.BytesDownloaded * 100) / p.TotalBytes;
        Console.WriteLine($"{p.FileName}: {percent:F1}%");
    })
);

Subsequent runs load instantly from cache (%LOCALAPPDATA%/ElBruno/LocalLLMs/models).

Skip auto-download if using a pre-downloaded model:

var options = new LocalLLMsOptions
{
    Model = KnownModels.Phi35MiniInstruct,
    ModelPath = "/path/to/local/model",
    EnsureModelDownloaded = false
};
using var client = await LocalChatClient.CreateAsync(options);

Streaming

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

using var client = await LocalChatClient.CreateAsync(new LocalLLMsOptions
{
    Model = KnownModels.Phi35MiniInstruct
});

await foreach (var update in client.GetStreamingResponseAsync([
    new(ChatRole.System, "You are a helpful assistant."),
    new(ChatRole.User, "Explain quantum computing in simple terms.")
]))
{
    Console.Write(update.Text);
}

GPU Acceleration

By default, ExecutionProvider.Auto tries GPU first (CUDA → DirectML) and falls back to CPU automatically:

// Use explicit GPU provider (fails if CUDA not installed; use Auto to fallback to CPU)
var options = new LocalLLMsOptions
{
    ExecutionProvider = ExecutionProvider.Cuda
};

// Multi-GPU systems: select device ID
var options2 = new LocalLLMsOptions
{
    ExecutionProvider = ExecutionProvider.Cuda,
    GpuDeviceId = 1  // Use second GPU
};

Auto fallback behavior:

CUDA available → uses NVIDIA GPU
CUDA unavailable, DirectML available → uses AMD/Intel Arc GPU
GPU unavailable → falls back to CPU (no errors, just slower)

See Troubleshooting: GPU Setup for debugging GPU issues.

Model Metadata

Inspect model capabilities at runtime — context window size, model name, and vocabulary:

using var client = await LocalChatClient.CreateAsync();

var metadata = client.ModelInfo;
Console.WriteLine($"Model:          {metadata?.ModelName}");
Console.WriteLine($"Context window: {metadata?.MaxSequenceLength}");
Console.WriteLine($"Vocab size:     {metadata?.VocabSize}");

This is useful for prompt-length validation, adaptive chunking, and model selection logic.

Dependency Injection

builder.Services.AddLocalLLMs(options =>
{
    options.Model = KnownModels.Phi35MiniInstruct;
    options.ExecutionProvider = ExecutionProvider.DirectML;
});

// Inject IChatClient anywhere
public class MyService(IChatClient chatClient) { ... }

Error Handling

The library provides structured exception types for graceful error handling:

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

try
{
    using var client = await LocalChatClient.CreateAsync();
    var response = await client.GetResponseAsync([
        new(ChatRole.User, "Your question here")
    ]);
}
catch (ExecutionProviderException ex)
{
    // GPU/provider-specific error (no CUDA, DirectML not available, etc.)
    Console.WriteLine($"Provider error: {ex.Message}");
}
catch (ModelCapacityExceededException ex)
{
    // Prompt/response too long for model's context window
    Console.WriteLine($"Capacity error: {ex.Message}");
    // Solution: use a larger model or truncate the prompt
}
catch (InvalidOperationException ex)
{
    // General operation error (model not found, download failed, etc.)
    Console.WriteLine($"Operation error: {ex.Message}");
}

Troubleshooting

GPU not working? Use ExecutionProvider.Cpu explicitly. See GPU Setup Validation.

Out of memory? Try a smaller model:

var options = new LocalLLMsOptions
{
    Model = KnownModels.Qwen25_05BInstruct  // 0.5B instead of 3.8B
};

Model download fails?

Check your internet connection
For private HuggingFace models, set the HF_TOKEN environment variable

For detailed troubleshooting, see docs/troubleshooting-guide.md.

Supported Models

Tier	Model	Parameters	ONNX	ID
⚪ Tiny	TinyLlama-1.1B-Chat	1.1B	✅ Native	`tinyllama-1.1b-chat`
⚪ Tiny	SmolLM2-1.7B-Instruct	1.7B	✅ Native	`smollm2-1.7b-instruct`
⚪ Tiny	Qwen2.5-0.5B-Instruct	0.5B	✅ Native	`qwen2.5-0.5b-instruct`
⚪ Tiny	Qwen2.5-1.5B-Instruct	1.5B	✅ Native	`qwen2.5-1.5b-instruct`
⚪ Tiny	Gemma-2B-IT	2B	✅ Native	`gemma-2b-it`
⚪ Tiny	Gemma-4-E2B-IT	5.1B (2B active)	⏳ Pending	`gemma-4-e2b-it`
⚪ Tiny	StableLM-2-1.6B-Chat	1.6B	🔄 Convert	`stablelm-2-1.6b-chat`
🟢 Small	Phi-3.5 mini instruct	3.8B	✅ Native	`phi-3.5-mini-instruct`
🟢 Small	Qwen2.5-3B-Instruct	3B	✅ Native	`qwen2.5-3b-instruct`
🟢 Small	Llama-3.2-3B-Instruct	3B	✅ Native	`llama-3.2-3b-instruct`
🟢 Small	Gemma-2-2B-IT	2B	✅ Native	`gemma-2-2b-it`
🟢 Small	Gemma-4-E4B-IT	8B (4B active)	⏳ Pending	`gemma-4-e4b-it`
🟡 Medium	Qwen2.5-7B-Instruct	7B	✅ Native	`qwen2.5-7b-instruct`
🟡 Medium	Llama-3.1-8B-Instruct	8B	✅ Native	`llama-3.1-8b-instruct`
🟡 Medium	Mistral-7B-Instruct-v0.3	7B	✅ Native	`mistral-7b-instruct-v0.3`
🟡 Medium	Gemma-2-9B-IT	9B	✅ Native	`gemma-2-9b-it`
🟡 Medium	Phi-4	14B	✅ Native	`phi-4`
🟡 Medium	DeepSeek-R1-Distill-Qwen-14B	14B	✅ Native	`deepseek-r1-distill-qwen-14b`
🟡 Medium	Mistral-Small-24B-Instruct	24B	✅ Native	`mistral-small-24b-instruct`
🔴 Large	Qwen2.5-14B-Instruct	14B	✅ Native	`qwen2.5-14b-instruct`
🔴 Large	Qwen2.5-32B-Instruct	32B	✅ Native	`qwen2.5-32b-instruct`
🔴 Large	Llama-3.3-70B-Instruct	70B	✅ ONNX	`llama-3.3-70b-instruct`
🔴 Large	Mixtral-8x7B-Instruct-v0.1	8x7B	🔄 Convert	`mixtral-8x7b-instruct-v0.1`
🔴 Large	DeepSeek-R1-Distill-Llama-70B	70B	🔄 Convert	`deepseek-r1-distill-llama-70b`
🔴 Large	Command-R (35B)	35B	🔄 Convert	`command-r-35b`
🔴 Large	Gemma-4-26B-A4B-IT	25.2B (3.8B active)	⏳ Pending	`gemma-4-26b-a4b-it`
🔴 Large	Gemma-4-31B-IT	30.7B	⏳ Pending	`gemma-4-31b-it`

⏳ Pending = Model definitions are ready but ONNX conversion requires runtime support from onnxruntime-genai. Gemma 4's novel PLE architecture is not yet supported.

Fine-Tuned Models

Pre-trained variants optimized for specific tasks. A fine-tuned 0.5B model often matches or exceeds a base 1.5B on its specialized task.

Model	Size	Task	HuggingFace ID
Qwen2.5-0.5B-ToolCalling	~1 GB	Tool/function calling	`elbruno/Qwen2.5-0.5B-LocalLLMs-ToolCalling`
Qwen2.5-0.5B-RAG	~1 GB	RAG with citations	`elbruno/Qwen2.5-0.5B-LocalLLMs-RAG`
Qwen2.5-0.5B-Instruct	~1 GB	General-purpose	`elbruno/Qwen2.5-0.5B-LocalLLMs-Instruct`

See the Supported Models Guide for detailed model cards, performance benchmarks, and selection guidance.

Samples

Sample	Description
HelloChat	Minimal console chat
StreamingChat	Token-by-token streaming
MultiModelChat	Switch models at runtime
DependencyInjection	ASP.NET Core DI registration
ToolCallingAgent	Function calling and tool use
FineTunedToolCalling	Fine-tuned model for improved tool calling
RagChatbot	RAG pipeline with document retrieval
ZeroCloudRag	Zero-cloud RAG pipeline with real local embeddings and LLM inference
BitNetChat	BitNet 1.58-bit model chat completion
BitNetPerformance	Performance benchmark: BitNet vs ONNX models
ConsoleAppDemo	Interactive console application

Requirements

.NET 8.0 or .NET 10.0
CPU (default), NVIDIA GPU (CUDA), or Windows GPU (DirectML)
~2-8 GB disk space per model (depending on size and quantization)

Building from Source

git clone https://github.com/elbruno/ElBruno.LocalLLMs.git
cd ElBruno.LocalLLMs
dotnet restore ElBruno.LocalLLMs.slnx
dotnet build ElBruno.LocalLLMs.slnx
dotnet test ElBruno.LocalLLMs.slnx --framework net8.0

Documentation

Getting Started — installation, first steps, configuration
Supported Models — full model reference with tiers, specs, decision tree
BitNet Guide — setup and usage of 1.58-bit BitNet models
Architecture — design decisions and internal structure
Samples Guide — walkthrough of each sample application
Benchmarks — how to run and interpret performance benchmarks
Fine-Tuning Guide — using and training fine-tuned models
ONNX Conversion — converting HuggingFace models to ONNX format
Publishing — NuGet package publishing with OIDC
Contributing — how to contribute
Changelog — version history

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

👋 About the Author

Made with ❤️ by Bruno Capuano (ElBruno)

📝 Blog: elbruno.com
📺 YouTube: youtube.com/elbruno
🔗 LinkedIn: linkedin.com/in/elbruno
𝕏 Twitter: twitter.com/elbruno
🎙️ Podcast: notienenombre.com

🙏 Acknowledgments

ONNX Runtime GenAI — inference engine
BitNet / bitnet.cpp — 1.58-bit ternary model inference
Microsoft.Extensions.AI — IChatClient interface
Hugging Face — model hosting and community

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- ElBruno.HuggingFace.Downloader (>= 0.6.0)
- Microsoft.Extensions.AI.Abstractions (>= 10.4.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.5)
- Microsoft.Extensions.Diagnostics.HealthChecks.Abstractions (>= 9.0.0)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.0)
- Microsoft.ML.OnnxRuntimeGenAI (>= 0.12.2)
net8.0
- ElBruno.HuggingFace.Downloader (>= 0.6.0)
- Microsoft.Extensions.AI.Abstractions (>= 10.4.0)
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.5)
- Microsoft.Extensions.Diagnostics.HealthChecks.Abstractions (>= 9.0.0)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.0)
- Microsoft.ML.OnnxRuntimeGenAI (>= 0.12.2)

NuGet packages (3)

Showing the top 3 NuGet packages that depend on ElBruno.LocalLLMs:

Package	Downloads
ElBruno.ModelContextProtocol.MCPToolRouter Semantic routing for Model Context Protocol (MCP) tool definitions using local embeddings. Indexes MCP tools and returns the most relevant tools for a given prompt via vector search.	564
ElBruno.LocalLLMs.Rag RAG (Retrieval-Augmented Generation) pipeline for ElBruno.LocalLLMs. Provides document chunking, embedding storage, and semantic search.	298
ElBruno.LocalLLMs.BitNet BitNet 1.58-bit LLM inference using bitnet.cpp. IChatClient implementation for Microsoft.Extensions.AI.	200

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
0.16.0	188	4/17/2026
0.15.0	133	4/16/2026
0.11.0	139	4/4/2026
0.9.0	110	4/4/2026
0.7.2	164	3/28/2026
0.7.1	124	3/28/2026
0.7.0	114	3/28/2026
0.6.1	119	3/28/2026
0.6.0	115	3/28/2026
0.5.0	158	3/28/2026
0.1.8	103	3/19/2026
0.1.7	102	3/18/2026
0.1.6	100	3/18/2026
0.1.0	102	3/18/2026