KernelSharp 1.0.1

There is a newer version of this package available.
See the version list below for details.

dotnet add package KernelSharp --version 1.0.1

NuGet\Install-Package KernelSharp -Version 1.0.1

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="KernelSharp" Version="1.0.1" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="KernelSharp" Version="1.0.1" />
                    

                            Directory.Packages.props

<PackageReference Include="KernelSharp" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add KernelSharp --version 1.0.1

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: KernelSharp, 1.0.1"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package KernelSharp@1.0.1

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=KernelSharp&version=1.0.1
                    

                            Install as a Cake Addin

#tool nuget:?package=KernelSharp&version=1.0.1
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

KernelSharp ⚡

Write CUDA kernels in C#. Compile at build time. Ship as a NuGet package.

KernelSharp is a .NET library that lets you write CUDA C/C++ kernels directly inside your C# source files — no cmake, no separate .cu build system, no runtime JIT overhead, no CUDA Runtime API boilerplate. Annotate a partial method with [GpuKernel], write the kernel inline with a raw string literal, and by the time your project finishes building the kernel is compiled, multi-arch, gzip-optionally-compressed, and embedded directly in your assembly. Runtime dispatch happens through the CUDA Driver API with zero managed allocations on the hot path.

Why KernelSharp?

Feature	KernelSharp	Typical CUDA .NET wrapper
Kernel source lives next to its C# caller	✅ inline raw string	❌ separate .cu / .ptx file
Build-time nvcc compilation	✅ MSBuild task, parallel	❌ manual CMake / MSBuild targets
Multi-arch fatbin (Ampere, Ada, Hopper …)	✅ automatic	❌ per-arch manual flags
Strongly-typed device buffers	✅ `CudaBuffer<T>`	❌ raw `IntPtr`
Zero-config compiler auto-detection	✅ nvcc + MSVC auto-discovered	❌ path config required
NuGet-installable, no CUDA SDK at runtime	✅ fatbin embedded in DLL	❌ SDK / driver headers required
Parallel kernel compilation	✅ all cores	❌ N/A
Single NuGet package	✅ runtime + build task in one	❌ separate packages

Quick Start

1 — Add the NuGet package

<PackageReference Include="KernelSharp" Version="1.0.0" />

A single package provides both the runtime (CudaBuffer<T>, CudaContext, …) and the MSBuild task that invokes nvcc at build time.

2 — Write your first kernel

using KernelSharp;

public partial class MyKernels
{
    [GpuKernel("""
        extern "C" __global__ void AddVectors(
            const float* __restrict__ a,
            const float* __restrict__ b,
            float*       __restrict__ c,
            int n)
        {
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n) c[i] = a[i] + b[i];
        }
        """)]
    public partial void AddVectors(CudaBuffer<float> a, CudaBuffer<float> b, CudaBuffer<float> c);
}

3 — Call it like any C# method

using var ctx = CudaContext.Initialize();  // initialises the CUDA Driver API

const int N = 1 << 20;                    // 1M elements

using var dA = CudaBuffer<float>.Allocate(N);
using var dB = CudaBuffer<float>.Allocate(N);
using var dC = CudaBuffer<float>.Allocate(N);

float[] hA = Enumerable.Range(0, N).Select(i => (float)i).ToArray();
float[] hB = Enumerable.Range(0, N).Select(i => (float)i * 2f).ToArray();

dA.CopyFromHost(hA);
dB.CopyFromHost(hB);

var kernels = new MyKernels();
kernels.AddVectors(dA, dB, dC);   // generated launch wrapper

float[] result = new float[N];
dC.CopyToHost(result);
Console.WriteLine(result[0]);   // 0.0 + 0.0 = 0.0 ✓

Multithreaded / parallel usage — CUDA Driver API contexts are thread-local. If you share a single CudaContext across threads (e.g. in a test suite with parallel test execution), call ctx.MakeCurrent() at the start of each thread before issuing any GPU operations. CudaContext.Initialize() only makes the context current on the thread that called it.

No cuModuleLoad, no cuLaunchKernel, no kernel argument marshalling — the MSBuild task writes all of that code for you.

The `[GpuKernel]` Attribute

Inline source (recommended)

Use a C# 11 raw string literal to embed CUDA C/C++ directly. No escaping needed:

[GpuKernel("""
    extern "C" __global__ void ReLU(const float* x, float* y, int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < n) y[i] = fmaxf(x[i], 0.f);
    }
    """)]
public partial void ReLU(CudaBuffer<float> x, CudaBuffer<float> y);

External `.cu` file

Point to a file on disk relative to the declaring C# source file:

[GpuKernel(SourceFile = "Kernels/flash_attn.cu")]
public partial void FlashAttn(
    CudaBuffer<float> q, CudaBuffer<float> k,
    CudaBuffer<float> v, CudaBuffer<float> o,
    int seqLen, int headDim);

Per-kernel overrides

[GpuKernel("""...""",
    Arch           = "compute_89",          // single arch — faster debug builds
    ExtraFlags     = "-lineinfo -G",        // add device debug info
    IncludePath    = "vendor/cutlass/include",
    ThreadsPerBlock = 128,                  // override default 256-thread blocks
    BlocksPerGrid  = 4)]                   // or fix the block count entirely
public partial void MyKernel(CudaBuffer<float> a, CudaBuffer<float> b);

The ThreadsPerBlock / BlocksPerGrid properties control the cuLaunchKernel grid:

`ThreadsPerBlock`	`BlocksPerGrid`	Generated launch
0 (default)	0 (default)	`threads=256, blocks=ceil(n/256)`
T	0	`threads=T, blocks=ceil(n/T)`
0	B	`threads=256, blocks=B`
T	B	`threads=T, blocks=B` (fully fixed — e.g. single-block scans)

Stub during development

[GpuKernel("""...""", NotImplemented = true)]
public partial void ExperimentalKernel(CudaBuffer<float> x);
// → throws NotImplementedException at runtime; nvcc is never invoked at build time

Strongly-Typed Device Buffers

CudaBuffer<T> is a typed wrapper around a CUDA device pointer. The element type is fixed at declaration time so the compiler catches host/device type mismatches early:

// Allocation — element count, not byte count
using var weights = CudaBuffer<float>.Allocate(hiddenDim);
using var tokens  = CudaBuffer<int>.Allocate(seqLen);
using var packed  = CudaBuffer<byte>.Allocate(quantBytes);

// Host ↔ Device transfers accept arrays or Span<T>
weights.CopyFromHost(floatArray);
weights.CopyFromHost(spanOfFloat);
weights.CopyToHost(destination);

// Introspect without touching the GPU
long byteSize = weights.ByteSize;   // elementCount * sizeof(float)
int  count    = weights.Length;
IntPtr ptr    = weights.DevicePointer;

Non-float example — int4 dequantisation kernel:

[GpuKernel("""
    extern "C" __global__ void DequantInt4(
        const uint8_t* packed, const float* scales, float* output, int n)
    {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i >= n) return;
        uint8_t b = packed[i >> 1];
        float   v = (i & 1) ? (b >> 4) : (b & 0xF);
        output[i] = (v - 8.f) * scales[i >> 128];
    }
    """)]
public partial void DequantInt4(
    CudaBuffer<byte>  packed,
    CudaBuffer<float> scales,
    CudaBuffer<float> output,
    int n);

How Build-Time Compilation Works

dotnet build
    │
    ├─ Roslyn compiles your C# code (including [GpuKernel] declarations)
    │
    └─ KernelSharp MSBuild task runs (BeforeTargets="CoreCompile")
           │
           ├─ Scans all .cs files for [GpuKernel] on partial methods
           ├─ Extracts inline source or reads the referenced .cu file
           ├─ Classifies each parameter:
           │     CudaBuffer<T>  → Buffer  → extract .DevicePointer
           │     int/float/...  → Scalar  → pass value directly
           ├─ Spawns nvcc processes in parallel (all CPU cores by default)
           │     one process per [GpuKernel] method
           ├─ Collects resulting fatbin bytes
           ├─ Optionally gzip-compresses the fatbin
           └─ Emits  MyClass.MyMethod.g.cs  containing:
                  • static readonly byte[] _fatbin = { … };
                  • static IntPtr _module, _func;
                  • public partial void MyMethod(…) { … cuLaunchKernel(…) }

    └─ Roslyn compiles the generated .g.cs files alongside your code
           → single assembly, zero external resources

The generated file includes a build-metadata comment showing the exact nvcc command line that produced the fatbin, compiler versions, and the date — making the build fully reproducible and auditable.

Incremental builds

The MSBuild task uses timestamp-based incremental compilation. If a kernel's source file hasn't changed since the last build, nvcc is not re-invoked. Cold builds (all kernels new) compile in parallel; warm builds (no changes) add essentially zero overhead.

Checking In Generated Files (CI without nvcc)

By default the generated .g.cs launcher files are written to $(IntermediateOutputPath) and are not committed to source control. If you want build machines that don't have CUDA installed to be able to compile your project, set KernelSharpGeneratedOutputPath to a committed folder:

<PropertyGroup>
  
  <KernelSharpGeneratedOutputPath>Generated\</KernelSharpGeneratedOutputPath>
</PropertyGroup>

When this property is set:

Generated .g.cs files are written to (and read from) that folder instead of obj/.
nvcc is still skipped when the generated file is newer than the source .cs file.
Machines without nvcc can compile using the checked-in launchers.

Compiler Auto-Detection

KernelSharp finds your compilers automatically — no path configuration required for most setups. Configuration properties are available for unusual installations.

nvcc detection order

CUDA_PATH environment variable → $CUDA_PATH/bin/nvcc
CUDA_TOOLKIT_ROOT_DIR environment variable → $CUDA_TOOLKIT_ROOT_DIR/bin/nvcc
PATH — each entry is checked for nvcc / nvcc.exe
Windows — %ProgramFiles%\NVIDIA GPU Computing Toolkit\CUDA\v*\bin\nvcc.exe
(all installed versions, newest first)
Linux — /usr/local/cuda/bin/nvcc, then /usr/bin/nvcc

MSVC `cl.exe` detection order (Windows only)

nvcc requires a compatible host C++ compiler on Windows. KernelSharp finds it without needing Visual Studio to be open or any environment pre-activation:

KernelSharpMsvcClPath MSBuild property — explicit full path or directory
VCToolsInstallDir environment variable (set by vcvarsall.bat)
vswhere — %ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe
Queries the latest pre-release or stable VS installation, enumerates MSVC toolchain versions inside it (newest first)
Directory scan — walks %ProgramFiles%\Microsoft Visual Studio\ and %ProgramFiles(x86)%\Microsoft Visual Studio\, year directories newest-first, edition directories newest-first (Enterprise → Preview → Community …), MSVC toolchain versions newest-first
PATH — last resort, checks each entry for cl.exe

On Linux, GCC is picked up by nvcc automatically; no host compiler detection is needed.

MSBuild Configuration

All settings have sensible defaults. Override only what you need:


<PropertyGroup>
  
  <KernelSharpIncludePath>C:\libs\cuda\include</KernelSharpIncludePath>

  
  <KernelSharpNvccStd>c++20</KernelSharpNvccStd>

  
  <KernelSharpNvccExtraFlags>-lineinfo</KernelSharpNvccExtraFlags>

  
  <KernelSharpMsvcClPath></KernelSharpMsvcClPath>

  
  <KernelSharpTargetArchs>compute_75,compute_80,compute_89,compute_90,compute_100</KernelSharpTargetArchs>

  
  <KernelSharpMaxParallelism>4</KernelSharpMaxParallelism>

  
  <KernelSharpFatbinCompression>gzip</KernelSharpFatbinCompression>

  
  <KernelSharpGeneratedOutputPath>Generated\</KernelSharpGeneratedOutputPath>
</PropertyGroup>

When installed as a NuGet package, build/KernelSharp.props is auto-imported and sets all these defaults — no manual setup required.

Build Diagnostics

Code	Severity	Meaning
`KERNELSHARP001`	Warning	nvcc not found — kernel compilation skipped, kernels will fail at runtime
`KERNELSHARP002`	Error	nvcc exited with a non-zero code — build fails with the nvcc error output
`KERNELSHARP003`	Warning	Mismatch between the `__global__` function name or parameter count in the CUDA source and the C# method declaration. The actual CUDA function name is still used for `cuModuleGetFunction`; this warning just flags the inconsistency so it can be fixed before it causes a runtime error.

Real-World Kernel Examples

Inclusive prefix scan (single-block Hillis-Steele)

[GpuKernel("""
    extern "C" __global__ void PrefixScan(
        const float* __restrict__ x,
        float*       __restrict__ y,
        int n)
    {
        // Single-block, shared-memory Hillis-Steele inclusive scan (≤256 elements).
        __shared__ float smem[256];
        int tid = threadIdx.x;
        smem[tid] = (tid < n) ? x[tid] : 0.f;
        __syncthreads();
        for (int d = 1; d < blockDim.x; d <<= 1) {
            float v = (tid >= d) ? smem[tid - d] : 0.f;
            __syncthreads();
            smem[tid] += v;
            __syncthreads();
        }
        if (tid < n) y[tid] = smem[tid];
    }
    """, ThreadsPerBlock = 256, BlocksPerGrid = 1)]   // ← fixed single-block launch
public partial void PrefixScan(CudaBuffer<float> x, CudaBuffer<float> y);

ThreadsPerBlock = 256, BlocksPerGrid = 1 tells the generator to emit a literal _threads=256; _blocks=1; rather than the default auto-compute. This is required for any single-block cooperative algorithm (scans, reductions that use __syncthreads across the whole grid, etc.).

Attention scores (transformer self-attention)

[GpuKernel("""
    extern "C" __global__ void AttnScores(
        const float* __restrict__ q,
        const float* __restrict__ k,
        float*       __restrict__ scores,
        int seqLen, int headDim)
    {
        int row = blockIdx.x, col = threadIdx.x;
        if (col >= seqLen) return;
        float dot = 0.f;
        for (int d = 0; d < headDim; d++)
            dot += q[row * headDim + d] * k[col * headDim + d];
        scores[row * seqLen + col] = dot * rsqrtf((float)headDim);
    }
    """)]
public partial void AttnScores(
    CudaBuffer<float> q,
    CudaBuffer<float> k,
    CudaBuffer<float> scores,
    int seqLen, int headDim);

RMS Normalisation (LLaMA / Mistral)

[GpuKernel("""
    extern "C" __global__ void RMSNorm(
        const float* x, const float* weight, float* y, int n, float eps)
    {
        float sum = 0.f;
        for (int i = threadIdx.x; i < n; i += blockDim.x)
            sum += x[i] * x[i];
        __shared__ float shared;
        if (threadIdx.x == 0) shared = rsqrtf(sum / n + eps);
        __syncthreads();
        for (int i = threadIdx.x; i < n; i += blockDim.x)
            y[i] = x[i] * shared * weight[i];
    }
    """)]
public partial void RMSNorm(
    CudaBuffer<float> x,
    CudaBuffer<float> weight,
    CudaBuffer<float> y,
    int n, float eps);

Embedding lookup (token → hidden state)

[GpuKernel("""
    extern "C" __global__ void EmbedLookup(
        const int* tokenIds, const float* table, float* output,
        int hiddenDim)
    {
        int tok = blockIdx.x, d = threadIdx.x;
        if (d < hiddenDim)
            output[tok * hiddenDim + d] = table[tokenIds[tok] * hiddenDim + d];
    }
    """)]
public partial void EmbedLookup(
    CudaBuffer<int>   tokenIds,
    CudaBuffer<float> table,
    CudaBuffer<float> output,
    int hiddenDim);

Supported Platforms & Requirements

	Windows	Linux
Host compiler	MSVC (VS 2019+, auto-detected)	GCC (picked up by nvcc)
nvcc version	CUDA 11.0+	CUDA 11.0+
.NET target	net8.0, net9.0, net10.0	net8.0, net9.0, net10.0
GPU architectures	sm_70 and newer	sm_70 and newer

The CUDA Runtime API is not required at runtime. KernelSharp uses only the CUDA Driver API (nvcuda.dll / libcuda.so), which ships with the display driver — no CUDA SDK installation needed on end-user machines.

Package

Package	Purpose
`KernelSharp`	Runtime + build task: `CudaBuffer<T>`, `CudaContext`, `CudaStream`, Driver API P/Invokes, and the MSBuild task that compiles CUDA kernels at build time

A single package covers everything. No separate generator package is needed.

Implementation note: inside the package, the MSBuild task lives in build/KernelSharp.Build.dll (a separate assembly from the runtime KernelSharp.dll). This prevents the MSBuild host from locking KernelSharp.dll during builds, allowing incremental rebuilds of the library itself without DLL-lock errors.

Building from Source

Prerequisites

Tool	Notes
.NET 10 SDK	Required
CUDA Toolkit (11.0+)	Required to compile `.cu` kernels; optional if you only work with `NotImplemented` stubs
MSVC (Visual Studio 2019+)	Windows only — required by nvcc as host compiler; auto-detected via `vswhere`

Repository layout

KernelSharp.Build/        ← MSBuild task (CompileCudaKernelsTask)
                            packed into build/KernelSharp.Build.dll in the NuGet package
KernelSharp/              ← Runtime library (CudaBuffer<T>, CudaContext, …)
  build/
    KernelSharp.props     ← auto-imported MSBuild properties
    KernelSharp.targets   ← UsingTask + KernelSharp_CompileCudaKernels target
KernelSharp.Samples/      ← example kernels
KernelSharp.Tests/        ← unit + integration tests (TUnit)

Build

dotnet build

Build order is KernelSharp.Build → KernelSharp → KernelSharp.Samples / KernelSharp.Tests. MSBuild respects the ProjectReference dependency chain, so KernelSharp.Build.dll is always ready before the task is needed.

Run tests

dotnet test

Tests that require a physical GPU are skipped automatically when no CUDA-capable device is detected. The code-generation tests (SourceGeneratorTests) run without a GPU.

Pack

dotnet pack KernelSharp/KernelSharp.csproj -c Release

The produced .nupkg contains:

lib/net10.0/KernelSharp.dll — runtime assembly
build/KernelSharp.targets — MSBuild target
build/KernelSharp.props — MSBuild properties
build/KernelSharp.Build.dll — MSBuild task assembly (loaded by MSBuild, never by end-user code)

License

MIT — see LICENSE.

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- No dependencies.
net8.0
- No dependencies.
net9.0
- No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.1.0	93	5/14/2026
1.0.1	94	5/12/2026
1.0.0	90	5/12/2026