KernelSharp 1.0.1
See the version list below for details.
dotnet add package KernelSharp --version 1.0.1
NuGet\Install-Package KernelSharp -Version 1.0.1
<PackageReference Include="KernelSharp" Version="1.0.1" />
<PackageVersion Include="KernelSharp" Version="1.0.1" />
<PackageReference Include="KernelSharp" />
paket add KernelSharp --version 1.0.1
#r "nuget: KernelSharp, 1.0.1"
#:package KernelSharp@1.0.1
#addin nuget:?package=KernelSharp&version=1.0.1
#tool nuget:?package=KernelSharp&version=1.0.1
KernelSharp ⚡
Write CUDA kernels in C#. Compile at build time. Ship as a NuGet package.
KernelSharp is a .NET library that lets you write CUDA C/C++ kernels
directly inside your C# source files — no cmake, no separate .cu build system, no runtime
JIT overhead, no CUDA Runtime API boilerplate. Annotate a partial method with
[GpuKernel], write the kernel inline with a raw string literal, and by the time your
project finishes building the kernel is compiled, multi-arch, gzip-optionally-compressed,
and embedded directly in your assembly. Runtime dispatch happens through the CUDA Driver
API with zero managed allocations on the hot path.
Why KernelSharp?
| Feature | KernelSharp | Typical CUDA .NET wrapper |
|---|---|---|
| Kernel source lives next to its C# caller | ✅ inline raw string | ❌ separate .cu / .ptx file |
| Build-time nvcc compilation | ✅ MSBuild task, parallel | ❌ manual CMake / MSBuild targets |
| Multi-arch fatbin (Ampere, Ada, Hopper …) | ✅ automatic | ❌ per-arch manual flags |
| Strongly-typed device buffers | ✅ CudaBuffer<T> |
❌ raw IntPtr |
| Zero-config compiler auto-detection | ✅ nvcc + MSVC auto-discovered | ❌ path config required |
| NuGet-installable, no CUDA SDK at runtime | ✅ fatbin embedded in DLL | ❌ SDK / driver headers required |
| Parallel kernel compilation | ✅ all cores | ❌ N/A |
| Single NuGet package | ✅ runtime + build task in one | ❌ separate packages |
Quick Start
1 — Add the NuGet package
<PackageReference Include="KernelSharp" Version="1.0.0" />
A single package provides both the runtime (CudaBuffer<T>, CudaContext, …) and the
MSBuild task that invokes nvcc at build time.
2 — Write your first kernel
using KernelSharp;
public partial class MyKernels
{
[GpuKernel("""
extern "C" __global__ void AddVectors(
const float* __restrict__ a,
const float* __restrict__ b,
float* __restrict__ c,
int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
""")]
public partial void AddVectors(CudaBuffer<float> a, CudaBuffer<float> b, CudaBuffer<float> c);
}
3 — Call it like any C# method
using var ctx = CudaContext.Initialize(); // initialises the CUDA Driver API
const int N = 1 << 20; // 1M elements
using var dA = CudaBuffer<float>.Allocate(N);
using var dB = CudaBuffer<float>.Allocate(N);
using var dC = CudaBuffer<float>.Allocate(N);
float[] hA = Enumerable.Range(0, N).Select(i => (float)i).ToArray();
float[] hB = Enumerable.Range(0, N).Select(i => (float)i * 2f).ToArray();
dA.CopyFromHost(hA);
dB.CopyFromHost(hB);
var kernels = new MyKernels();
kernels.AddVectors(dA, dB, dC); // generated launch wrapper
float[] result = new float[N];
dC.CopyToHost(result);
Console.WriteLine(result[0]); // 0.0 + 0.0 = 0.0 ✓
Multithreaded / parallel usage — CUDA Driver API contexts are thread-local. If you share a single
CudaContextacross threads (e.g. in a test suite with parallel test execution), callctx.MakeCurrent()at the start of each thread before issuing any GPU operations.CudaContext.Initialize()only makes the context current on the thread that called it.
No cuModuleLoad, no cuLaunchKernel, no kernel argument marshalling — the MSBuild
task writes all of that code for you.
The [GpuKernel] Attribute
Inline source (recommended)
Use a C# 11 raw string literal to embed CUDA C/C++ directly. No escaping needed:
[GpuKernel("""
extern "C" __global__ void ReLU(const float* x, float* y, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = fmaxf(x[i], 0.f);
}
""")]
public partial void ReLU(CudaBuffer<float> x, CudaBuffer<float> y);
External .cu file
Point to a file on disk relative to the declaring C# source file:
[GpuKernel(SourceFile = "Kernels/flash_attn.cu")]
public partial void FlashAttn(
CudaBuffer<float> q, CudaBuffer<float> k,
CudaBuffer<float> v, CudaBuffer<float> o,
int seqLen, int headDim);
Per-kernel overrides
[GpuKernel("""...""",
Arch = "compute_89", // single arch — faster debug builds
ExtraFlags = "-lineinfo -G", // add device debug info
IncludePath = "vendor/cutlass/include",
ThreadsPerBlock = 128, // override default 256-thread blocks
BlocksPerGrid = 4)] // or fix the block count entirely
public partial void MyKernel(CudaBuffer<float> a, CudaBuffer<float> b);
The ThreadsPerBlock / BlocksPerGrid properties control the cuLaunchKernel grid:
ThreadsPerBlock |
BlocksPerGrid |
Generated launch |
|---|---|---|
| 0 (default) | 0 (default) | threads=256, blocks=ceil(n/256) |
| T | 0 | threads=T, blocks=ceil(n/T) |
| 0 | B | threads=256, blocks=B |
| T | B | threads=T, blocks=B (fully fixed — e.g. single-block scans) |
Stub during development
[GpuKernel("""...""", NotImplemented = true)]
public partial void ExperimentalKernel(CudaBuffer<float> x);
// → throws NotImplementedException at runtime; nvcc is never invoked at build time
Strongly-Typed Device Buffers
CudaBuffer<T> is a typed wrapper around a CUDA device pointer. The element type is
fixed at declaration time so the compiler catches host/device type mismatches early:
// Allocation — element count, not byte count
using var weights = CudaBuffer<float>.Allocate(hiddenDim);
using var tokens = CudaBuffer<int>.Allocate(seqLen);
using var packed = CudaBuffer<byte>.Allocate(quantBytes);
// Host ↔ Device transfers accept arrays or Span<T>
weights.CopyFromHost(floatArray);
weights.CopyFromHost(spanOfFloat);
weights.CopyToHost(destination);
// Introspect without touching the GPU
long byteSize = weights.ByteSize; // elementCount * sizeof(float)
int count = weights.Length;
IntPtr ptr = weights.DevicePointer;
Non-float example — int4 dequantisation kernel:
[GpuKernel("""
extern "C" __global__ void DequantInt4(
const uint8_t* packed, const float* scales, float* output, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
uint8_t b = packed[i >> 1];
float v = (i & 1) ? (b >> 4) : (b & 0xF);
output[i] = (v - 8.f) * scales[i >> 128];
}
""")]
public partial void DequantInt4(
CudaBuffer<byte> packed,
CudaBuffer<float> scales,
CudaBuffer<float> output,
int n);
How Build-Time Compilation Works
dotnet build
│
├─ Roslyn compiles your C# code (including [GpuKernel] declarations)
│
└─ KernelSharp MSBuild task runs (BeforeTargets="CoreCompile")
│
├─ Scans all .cs files for [GpuKernel] on partial methods
├─ Extracts inline source or reads the referenced .cu file
├─ Classifies each parameter:
│ CudaBuffer<T> → Buffer → extract .DevicePointer
│ int/float/... → Scalar → pass value directly
├─ Spawns nvcc processes in parallel (all CPU cores by default)
│ one process per [GpuKernel] method
├─ Collects resulting fatbin bytes
├─ Optionally gzip-compresses the fatbin
└─ Emits MyClass.MyMethod.g.cs containing:
• static readonly byte[] _fatbin = { … };
• static IntPtr _module, _func;
• public partial void MyMethod(…) { … cuLaunchKernel(…) }
└─ Roslyn compiles the generated .g.cs files alongside your code
→ single assembly, zero external resources
The generated file includes a build-metadata comment showing the exact nvcc command line that produced the fatbin, compiler versions, and the date — making the build fully reproducible and auditable.
Incremental builds
The MSBuild task uses timestamp-based incremental compilation. If a kernel's source file hasn't changed since the last build, nvcc is not re-invoked. Cold builds (all kernels new) compile in parallel; warm builds (no changes) add essentially zero overhead.
Checking In Generated Files (CI without nvcc)
By default the generated .g.cs launcher files are written to $(IntermediateOutputPath)
and are not committed to source control. If you want build machines that don't have CUDA
installed to be able to compile your project, set KernelSharpGeneratedOutputPath to a
committed folder:
<PropertyGroup>
<KernelSharpGeneratedOutputPath>Generated\</KernelSharpGeneratedOutputPath>
</PropertyGroup>
When this property is set:
- Generated
.g.csfiles are written to (and read from) that folder instead ofobj/. - nvcc is still skipped when the generated file is newer than the source
.csfile. - Machines without nvcc can compile using the checked-in launchers.
Compiler Auto-Detection
KernelSharp finds your compilers automatically — no path configuration required for most setups. Configuration properties are available for unusual installations.
nvcc detection order
CUDA_PATHenvironment variable →$CUDA_PATH/bin/nvccCUDA_TOOLKIT_ROOT_DIRenvironment variable →$CUDA_TOOLKIT_ROOT_DIR/bin/nvccPATH— each entry is checked fornvcc/nvcc.exe- Windows —
%ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v*\bin\nvcc.exe
(all installed versions, newest first) - Linux —
/usr/local/cuda/bin/nvcc, then/usr/bin/nvcc
MSVC cl.exe detection order (Windows only)
nvcc requires a compatible host C++ compiler on Windows. KernelSharp finds it without needing Visual Studio to be open or any environment pre-activation:
KernelSharpMsvcClPathMSBuild property — explicit full path or directoryVCToolsInstallDirenvironment variable (set byvcvarsall.bat)- vswhere —
%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe
Queries the latest pre-release or stable VS installation, enumerates MSVC toolchain versions inside it (newest first) - Directory scan — walks
%ProgramFiles%\Microsoft Visual Studio\and%ProgramFiles(x86)%\Microsoft Visual Studio\, year directories newest-first, edition directories newest-first (Enterprise → Preview → Community …), MSVC toolchain versions newest-first PATH— last resort, checks each entry forcl.exe
On Linux, GCC is picked up by nvcc automatically; no host compiler detection is needed.
MSBuild Configuration
All settings have sensible defaults. Override only what you need:
<PropertyGroup>
<KernelSharpIncludePath>C:\libs\cuda\include</KernelSharpIncludePath>
<KernelSharpNvccStd>c++20</KernelSharpNvccStd>
<KernelSharpNvccExtraFlags>-lineinfo</KernelSharpNvccExtraFlags>
<KernelSharpMsvcClPath></KernelSharpMsvcClPath>
<KernelSharpTargetArchs>compute_75,compute_80,compute_89,compute_90,compute_100</KernelSharpTargetArchs>
<KernelSharpMaxParallelism>4</KernelSharpMaxParallelism>
<KernelSharpFatbinCompression>gzip</KernelSharpFatbinCompression>
<KernelSharpGeneratedOutputPath>Generated\</KernelSharpGeneratedOutputPath>
</PropertyGroup>
When installed as a NuGet package, build/KernelSharp.props is auto-imported and sets
all these defaults — no manual setup required.
Build Diagnostics
| Code | Severity | Meaning |
|---|---|---|
KERNELSHARP001 |
Warning | nvcc not found — kernel compilation skipped, kernels will fail at runtime |
KERNELSHARP002 |
Error | nvcc exited with a non-zero code — build fails with the nvcc error output |
KERNELSHARP003 |
Warning | Mismatch between the __global__ function name or parameter count in the CUDA source and the C# method declaration. The actual CUDA function name is still used for cuModuleGetFunction; this warning just flags the inconsistency so it can be fixed before it causes a runtime error. |
Real-World Kernel Examples
Inclusive prefix scan (single-block Hillis-Steele)
[GpuKernel("""
extern "C" __global__ void PrefixScan(
const float* __restrict__ x,
float* __restrict__ y,
int n)
{
// Single-block, shared-memory Hillis-Steele inclusive scan (≤256 elements).
__shared__ float smem[256];
int tid = threadIdx.x;
smem[tid] = (tid < n) ? x[tid] : 0.f;
__syncthreads();
for (int d = 1; d < blockDim.x; d <<= 1) {
float v = (tid >= d) ? smem[tid - d] : 0.f;
__syncthreads();
smem[tid] += v;
__syncthreads();
}
if (tid < n) y[tid] = smem[tid];
}
""", ThreadsPerBlock = 256, BlocksPerGrid = 1)] // ← fixed single-block launch
public partial void PrefixScan(CudaBuffer<float> x, CudaBuffer<float> y);
ThreadsPerBlock = 256, BlocksPerGrid = 1 tells the generator to emit a literal
_threads=256; _blocks=1; rather than the default auto-compute. This is required for
any single-block cooperative algorithm (scans, reductions that use __syncthreads
across the whole grid, etc.).
Attention scores (transformer self-attention)
[GpuKernel("""
extern "C" __global__ void AttnScores(
const float* __restrict__ q,
const float* __restrict__ k,
float* __restrict__ scores,
int seqLen, int headDim)
{
int row = blockIdx.x, col = threadIdx.x;
if (col >= seqLen) return;
float dot = 0.f;
for (int d = 0; d < headDim; d++)
dot += q[row * headDim + d] * k[col * headDim + d];
scores[row * seqLen + col] = dot * rsqrtf((float)headDim);
}
""")]
public partial void AttnScores(
CudaBuffer<float> q,
CudaBuffer<float> k,
CudaBuffer<float> scores,
int seqLen, int headDim);
RMS Normalisation (LLaMA / Mistral)
[GpuKernel("""
extern "C" __global__ void RMSNorm(
const float* x, const float* weight, float* y, int n, float eps)
{
float sum = 0.f;
for (int i = threadIdx.x; i < n; i += blockDim.x)
sum += x[i] * x[i];
__shared__ float shared;
if (threadIdx.x == 0) shared = rsqrtf(sum / n + eps);
__syncthreads();
for (int i = threadIdx.x; i < n; i += blockDim.x)
y[i] = x[i] * shared * weight[i];
}
""")]
public partial void RMSNorm(
CudaBuffer<float> x,
CudaBuffer<float> weight,
CudaBuffer<float> y,
int n, float eps);
Embedding lookup (token → hidden state)
[GpuKernel("""
extern "C" __global__ void EmbedLookup(
const int* tokenIds, const float* table, float* output,
int hiddenDim)
{
int tok = blockIdx.x, d = threadIdx.x;
if (d < hiddenDim)
output[tok * hiddenDim + d] = table[tokenIds[tok] * hiddenDim + d];
}
""")]
public partial void EmbedLookup(
CudaBuffer<int> tokenIds,
CudaBuffer<float> table,
CudaBuffer<float> output,
int hiddenDim);
Supported Platforms & Requirements
| Windows | Linux | |
|---|---|---|
| Host compiler | MSVC (VS 2019+, auto-detected) | GCC (picked up by nvcc) |
| nvcc version | CUDA 11.0+ | CUDA 11.0+ |
| .NET target | net8.0, net9.0, net10.0 | net8.0, net9.0, net10.0 |
| GPU architectures | sm_70 and newer | sm_70 and newer |
The CUDA Runtime API is not required at runtime. KernelSharp uses only the
CUDA Driver API (nvcuda.dll / libcuda.so), which ships with the display driver —
no CUDA SDK installation needed on end-user machines.
Package
| Package | Purpose |
|---|---|
KernelSharp |
Runtime + build task: CudaBuffer<T>, CudaContext, CudaStream, Driver API P/Invokes, and the MSBuild task that compiles CUDA kernels at build time |
A single package covers everything. No separate generator package is needed.
Implementation note: inside the package, the MSBuild task lives in
build/KernelSharp.Build.dll(a separate assembly from the runtimeKernelSharp.dll). This prevents the MSBuild host from lockingKernelSharp.dllduring builds, allowing incremental rebuilds of the library itself without DLL-lock errors.
Building from Source
Prerequisites
| Tool | Notes |
|---|---|
| .NET 10 SDK | Required |
| CUDA Toolkit (11.0+) | Required to compile .cu kernels; optional if you only work with NotImplemented stubs |
| MSVC (Visual Studio 2019+) | Windows only — required by nvcc as host compiler; auto-detected via vswhere |
Repository layout
KernelSharp.Build/ ← MSBuild task (CompileCudaKernelsTask)
packed into build/KernelSharp.Build.dll in the NuGet package
KernelSharp/ ← Runtime library (CudaBuffer<T>, CudaContext, …)
build/
KernelSharp.props ← auto-imported MSBuild properties
KernelSharp.targets ← UsingTask + KernelSharp_CompileCudaKernels target
KernelSharp.Samples/ ← example kernels
KernelSharp.Tests/ ← unit + integration tests (TUnit)
Build
dotnet build
Build order is KernelSharp.Build → KernelSharp → KernelSharp.Samples /
KernelSharp.Tests. MSBuild respects the ProjectReference dependency chain, so
KernelSharp.Build.dll is always ready before the task is needed.
Run tests
dotnet test
Tests that require a physical GPU are skipped automatically when no CUDA-capable
device is detected. The code-generation tests (SourceGeneratorTests) run without a
GPU.
Pack
dotnet pack KernelSharp/KernelSharp.csproj -c Release
The produced .nupkg contains:
lib/net10.0/KernelSharp.dll— runtime assemblybuild/KernelSharp.targets— MSBuild targetbuild/KernelSharp.props— MSBuild propertiesbuild/KernelSharp.Build.dll— MSBuild task assembly (loaded by MSBuild, never by end-user code)
License
MIT — see LICENSE.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- No dependencies.
-
net8.0
- No dependencies.
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.