Caveman 1.0.3

.NET 8.0

dotnet add package Caveman --version 1.0.3

NuGet\Install-Package Caveman -Version 1.0.3

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Caveman" Version="1.0.3" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Caveman" Version="1.0.3" />
                    

                            Directory.Packages.props

<PackageReference Include="Caveman" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Caveman --version 1.0.3

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Caveman, 1.0.3"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Caveman@1.0.3

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Caveman&version=1.0.3
                    

                            Install as a Cake Addin

#tool nuget:?package=Caveman&version=1.0.3
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

🦴 Caveman — Prompt Compressor for LLMs

Caveman is a self-contained C# library that drastically reduces the number of tokens in your LLM prompts (Gemma 3, Llama, GPT-4, …). It strips grammatical "noise" (articles, prepositions, conjunctions, auxiliaries) and normalises inflected words to their base form, keeping the semantic payload intact.

"Why use many tokens when few tokens do trick?" — A caveman (and your wallet).

It is inspired by the token-saving idea behind the Caveman plugin for Claude, but it is an independent implementation written from scratch — no porting and no runtime NLP-model dependency.

✨ Highlights

Up to 70% token reduction — slash API costs and speed up local inference.
50+ languages out of the box — language data is embedded in the assembly; nothing to download at runtime.
No heavy NLP runtime — pure lookup + heuristics over per-language word data. The only package dependency is Microsoft.SemanticKernel (for the optional plugins).
Three compression levels — Light, Semantic, Aggressive.
Fast language detection — a streaming parser reads only the stop-word section of each language to identify the input.
Batch & custom filters — CompressBatchAsync() and user-defined POS-style filters.
Semantic Kernel plugins + a suite of developer services (commit/review/stats/safety/wiki).

🛠️ Installation

dotnet add package Caveman

That's it — all language data ships inside the package. There are no models to install.

Quick start

using caveman.core;

var compressor = new CavemanCompressionService();
string input = "I would like to know if it is possible to receive information about cheap restaurants in Rome.";

var result = await compressor.CompressAsync(input, CavemanCompressionLevel.Semantic);

Console.WriteLine($"Compressed: {result.CompressedText}");
Console.WriteLine($"Efficiency: {result.EfficiencyPercentage:F1}%");
Console.WriteLine($"🌿 Energy saved: {result.EstimatedEnergySavedMWh:F3} mWh");

The input language is detected automatically; you can also call ApplyCompression(text, iso3, level) to force a specific language (ISO 639-3 code).

🌐 Language detection (standalone)

You don't need to compress anything to use Caveman's language detector — it works on its own across all 50+ supported languages, with no model download:

var caveman = new CavemanCompressionService();

string iso3 = caveman.DetectLanguage("Vorrei un tavolo per due persone, per favore.");
// -> "ita"

// or get confidence scores per language (ISO 639-3 -> ratio of matched stop words)
var scores = caveman.DetectLanguageScores("Where is the nearest train station?");
// -> { "eng": 0.42, ... }

The detector is also usable directly via CavemanLanguageDetector if you don't want the compression service:

var detector = new CavemanLanguageDetector();
string iso3 = detector.Detect("Ich hätte gerne einen Kaffee.");   // -> "deu"

Detection is backed by a tiny embedded stop-word index, so it stays fast even though it scores every supported language.

📊 Compression levels

Level	Applied logic	What is kept	Typical savings
Light	Stop-word removal	Everything except function words & punctuation	~25–30%
Semantic	Content selection + lemmatization	Content words, normalised to their base form	~50%
Aggressive	Lemmatization + generic/descriptive pruning	Core nouns/verbs in base form	~70%

Example

State	Prompt	Size
Original	"I would like to know if it is possible to have a margherita pizza immediately."	100%
Light	"like know possible have margherita pizza immediately"	~70%
Semantic	"know possible have margherita pizza immediately"	~55%
Aggressive	"know possible margherita pizza"	~40%

🌍 How it works

Caveman does not load any NLP model at runtime. Each language is described by a worddata/<iso3>.yaml source file with four sections:

function_words — stop words, used both for compression and for language detection.
lemmas — inflected form → base form map (e.g. studying → study, gatti → gatto).
verbs — base verb → [conjugated forms]; folded into the lemma map at load time so every conjugation collapses to its base.
proper_nouns — a name gazetteer; capitalized tokens in it are kept verbatim (so names like Termini or München are never compressed).

For shipping, these YAML sources are compiled (by scripts/compile-worddata) into compact embedded artifacts and a custom streaming parser keeps loading fast:

Detection reads a tiny brotli-compressed index (_index.br) holding only the stop words of every language, and scores the input by stop-word frequency — the large per-language data is never touched.
Compression then loads the one detected language from its brotli blob (<iso3>.yaml.br), decompresses + parses it once, caches it, and applies the selected level.

This keeps the assembly small (~13 MB instead of ~68 MB of raw text) while loading only the language actually used.

Function words are dropped by their surface form before lemmatization, so a noisy lemma can never reinject a stop word.

Language data & provenance

The lemmas and verbs data are generated from the Universal Dependencies treebanks via scripts/import-ud-lemmas. Languages with little inflection (Chinese, Vietnamese, Thai, …) intentionally carry few or no lemma entries. See NOTICE for per-language attribution.

🚀 Batch compression & custom filters

Batch — compress many prompts in one call:

string[] prompts =
{
    "I would like to know about cheap restaurants in Rome.",
    "Tell me how to get to the Colosseum from Termini station."
};

var results = await compressor.CompressBatchAsync(prompts, CavemanCompressionLevel.Semantic);
foreach (var r in results)
    Console.WriteLine($"{r.CompressedText}  (error: {r.ErrorMessage ?? "none"})");

Custom filters — override the default rules:

var filter = new CompressionFilter
{
    KeepOnly = new HashSet<string> { "CONTENT", "PROPN" },        // keep content words & proper nouns
    CustomPredicate = token => token.Length > 2                    // skip very short tokens
};

var result = await compressor.CompressAsync(input, CavemanCompressionLevel.None, filter);

You can also blacklist categories with Remove (e.g. "FUNC", "PUNCT").

🌿 Sustainability

Every token processed by an LLM has an energy cost. Caveman exposes a built-in estimator:

Energy saved: ~0.005 mWh (5 µWh) per saved token.
CO₂ avoided: ~0.4 mg per mWh saved.

Compressing a prompt from 1000 → 400 tokens saves ~3 mWh and avoids ~1.2 mg CO₂. At scale, that adds up.

🔌 Semantic Kernel integration

var builder = Kernel.CreateBuilder();
builder.Plugins.AddFromType<TokenOptimizerPlugin>();
var kernel = builder.Build();

var result = await kernel.InvokeAsync<CompressionResult>("TokenOptimizer", "OptimizePrompt", new()
{
    ["input"] = "I would like to know if it's possible to get pizza near Rome.",
    ["level"] = 2  // Semantic
});

TokenOptimizerPlugin — prompt compression as a kernel function.
CavemanWikiPlugin — on-demand, token-optimized project documentation (generate_project_wiki, get_project_summary, detect_project_type).
CavemanServicesPlugin — exposes the developer services below.

🦴 Caveman services (developer toolkit)

Service	What it does
`CavemanContextCompressor`	Compresses context files (CLAUDE.md, notes) into caveman-speak.
`CavemanCommitGenerator`	Conventional commit messages from a git diff, under 50 chars.
`CavemanReviewService`	Single-line PR review comments from a diff.
`CavemanStatsTracker`	Tracks token & cost savings across sessions (persists to `%LOCALAPPDATA%/Caveman`).
`CavemanSafetyGuard`	Auto-disables compression for security-critical/destructive content.
`CavecrewService`	Micro-agents: investigator / builder / reviewer.
`CavemanWiki`	AI-friendly, semantically compressed project documentation.

var wiki = new CavemanWiki();
string context = await wiki.GenerateAsync(@"C:\Dev\MyProject");
await File.WriteAllTextAsync("AI_CONTEXT.md", context);

📄 License & attribution

Caveman is released under the Caveman License — the MIT License plus one mandatory condition:

Any use of this library must clearly and visibly disclose that it uses "Caveman" by Passaro Francesco Paolo (Digitalsolutions.it).

A disclosure such as the following, in your docs, an About/credits screen, or your repository, satisfies the requirement:

Powered by Caveman — © Passaro Francesco Paolo, Digitalsolutions.it (https://digitalsolutions.it)

See LICENSE for the full terms.

Bundled language data under worddata/ is derived from the Universal Dependencies treebanks and is provided under their respective licenses (predominantly CC BY-SA / CC BY), not under the Caveman software license. See NOTICE for attribution and source treebanks.

🤝 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

To regenerate the language data from Universal Dependencies and recompile the embedded artifacts:

# 1. import lemmas / verbs / proper nouns into worddata/*.yaml (the source)
dotnet run --project scripts/import-ud-lemmas -- --all     # all languages
dotnet run --project scripts/import-ud-lemmas -- ita fra   # specific languages

# 2. compile worddata/*.yaml -> worddata/*.yaml.br + worddata/_index.br (embedded)
dotnet run --project scripts/compile-worddata

# 3. rebuild the package so it embeds the fresh artifacts
dotnet pack caveman.core.csproj -c Release

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Microsoft.SemanticKernel (>= 1.74.0)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.3	46	6/3/2026
1.0.2	66	5/3/2026
1.0.1	66	4/27/2026
1.0.0	75	4/27/2026

1.0.3: Self-contained engine (no Catalyst/YamlDotNet at runtime). Lemmas, verb forms and a proper-noun gazetteer for 50+ languages imported from Universal Dependencies; verbs drive compression and names are kept verbatim. Language data compiled to brotli artifacts (~13 MB assembly vs ~68 MB) with a compact detection index. Licensed under the Caveman License (MIT + mandatory attribution); bundled data under Universal Dependencies terms (see NOTICE).