LangDetect.Net 1.0.0

There is a newer version of this package available.
See the version list below for details.

dotnet add package LangDetect.Net --version 1.0.0

NuGet\Install-Package LangDetect.Net -Version 1.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="LangDetect.Net" Version="1.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="LangDetect.Net" Version="1.0.0" />
                    

                            Directory.Packages.props

<PackageReference Include="LangDetect.Net" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add LangDetect.Net --version 1.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: LangDetect.Net, 1.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package LangDetect.Net@1.0.0

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=LangDetect.Net&version=1.0.0
                    

                            Install as a Cake Addin

#tool nuget:?package=LangDetect.Net&version=1.0.0
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

LangDetect

Lightweight, self-contained language detection for .NET — no cloud, no Python, no runtime dependencies.

LangDetect is a .NET 8 class library for detecting the language of a given text string. It uses a multi-stage detection pipeline — Unicode script analysis, common word frequency matching, and character trigram profiling — to identify languages accurately across both native scripts and romanized Latin representations.

Features

Zero external dependencies — no cloud APIs, no Python runtime, no native libraries
Multi-stage pipeline — Unicode detection → word frequency → trigram fallback
Romanized script support — detects Singlish, Tanglish, Pinyin, Romaji, and other Latin-script representations
Confidence scoring — every result includes a confidence score and IsReliable flag
ISO 639-1 codes — all results expose standard language codes (en, ar, si, etc.)
Configurable word list sizes — choose between Small (200), Medium (500), or Large (1000) word lists
DI-friendly — first-class support for Microsoft.Extensions.DependencyInjection
NuGet-ready — single package install, embedded word lists and trigram data included

Supported Languages

Language	Native Script	Unicode Range	Romanized Detection
English	Latin	U+0000–U+007F	Native (Latin)
Arabic	Arabic	U+0600–U+06FF	Word list fallback
Hindi	Devanagari	U+0900–U+097F	Word list fallback
Mandarin	CJK Ideographs	U+4E00–U+9FFF	Word list fallback (Pinyin)
Japanese	Hiragana + Katakana	U+3040–U+30FF	Word list fallback (Romaji)
Korean	Hangul Syllables	U+AC00–U+D7AF	Word list fallback
Sinhala	Sinhala	U+0D80–U+0DFF	Word list fallback (Singlish)
Tamil	Tamil	U+0B80–U+0BFF	Word list fallback (Tanglish)

Installation

dotnet add package LangDetect

Or via the NuGet Package Manager in Visual Studio — search for LangDetect.

Quick Start

using LangDetect;
using LangDetect.Models;

// create a detector with default options
var factory  = new LanguageDetectorFactory();
var detector = factory.Create();

var result = detector.Detect("The quick brown fox jumps over the lazy dog");

Console.WriteLine(result.Language);    // English
Console.WriteLine(result.IsoCode);     // en
Console.WriteLine(result.Confidence);  // 1.00
Console.WriteLine(result.IsReliable);  // True
Console.WriteLine(result.DetectedBy);  // CommonWordDetectionStage

Detection Result

Every call to Detect() returns a DetectionResult record:

public record DetectionResult
{
    public Language Language     { get; init; }  // detected language or Unknown
    public float    Confidence   { get; init; }  // 0.0 – 1.0
    public bool     IsReliable   { get; init; }  // confidence >= configured threshold
    public string   DetectedBy   { get; init; }  // which pipeline stage fired
    public string   IsoCode      { get; init; }  // ISO 639-1 code e.g. "en", "si"
}

When detection fails or input is too short, DetectionResult.Unknown is returned — Detect() never throws for valid string input.

Configuration

var detector = new LanguageDetectorFactory().Create(new DetectorOptions
{
    ConfidenceThreshold = 0.80f,          // minimum score to be considered reliable
    EnableEarlyExit     = true,           // stop pipeline once confident result found
    WordListSize        = WordListSize.Large, // Small (200) | Medium (500) | Large (1000)
    MinInputLength      = 3,              // inputs shorter than this return Unknown
    MaxTokens           = 500,            // truncate long inputs before analysis
    MinNonLatinRatio    = 0.25f,          // minimum non-Latin ratio to trigger Unicode path
    Logger              = Console.WriteLine, // optional diagnostic logger
});

Word list sizes

Size	Words	Use case
`WordListSize.Small`	200	Memory-constrained environments, fast startup
`WordListSize.Medium`	500	Balanced — recommended default
`WordListSize.Large`	1000	Best accuracy, especially for short inputs

Dependency Injection

// Program.cs
builder.Services.AddLanguageDetector(options =>
{
    options.ConfidenceThreshold = 0.80f;
    options.WordListSize        = WordListSize.Large;
});

// inject and use anywhere
public class ContentService(IAudioLanguageDetector detector)
{
    public string GetLanguage(string text)
    {
        var result = detector.Detect(text);
        return result.IsReliable
            ? $"{result.Language} ({result.IsoCode})"
            : "Unknown";
    }
}

Detection Pipeline

LangDetect uses a three-stage pipeline. Each stage runs in priority order and the result is returned as soon as a confident detection is made (early exit).

Input text
    │
    ▼
TextPreprocessor
    │  normalize → tokenize → compute HasNonLatinUnicode + NonLatinRatio
    ▼
Does text contain non-Latin Unicode above MinNonLatinRatio threshold?
    │
    ├── YES → Stage 1: UnicodeDetectionStage
    │              Checks script coverage against 7 Unicode range profiles
    │              Confident result → return early
    │              Not confident   → fall through to Stage 2 + 3
    │
    └── NO  → Stage 2: CommonWordDetectionStage
                   Matches tokens against romanized word lists
                   Confident result → return early
                   Not confident   → Stage 3: NGramDetectionStage
                                         Scores character trigram profiles
                                         Returns best match or Unknown

Stage details

Stage	Priority	Technique	Best for
`UnicodeDetectionStage`	1	Script range coverage ratio	Arabic, Hindi, Mandarin, Japanese, Korean, Sinhala, Tamil in native script
`CommonWordDetectionStage`	2	Token frequency matching	English, romanized scripts
`NGramDetectionStage`	3	Character trigram scoring	Short inputs, ambiguous text

Examples

Native script detection

detector.Detect("مرحبا كيف حالك اليوم");
// → { Language: Arabic, Confidence: 1.00, IsReliable: true, IsoCode: "ar" }

detector.Detect("नमस्ते आप कैसे हैं");
// → { Language: Hindi, Confidence: 1.00, IsReliable: true, IsoCode: "hi" }

detector.Detect("こんにちは世界");
// → { Language: Japanese, Confidence: 0.81, IsReliable: true, IsoCode: "ja" }

Romanized script detection

detector.Detect("mama giye koheda kiyala amma");
// → { Language: Sinhala, Confidence: 0.85, IsReliable: true, IsoCode: "si" }

detector.Detect("naan pogiren enna romba thanks");
// → { Language: Tamil, Confidence: 0.80, IsReliable: true, IsoCode: "ta" }

Graceful unknown handling

detector.Detect("");          // → DetectionResult.Unknown
detector.Detect("123456");   // → DetectionResult.Unknown
detector.Detect(null);       // → DetectionResult.Unknown (never throws)

Diagnostic logging

var detector = new LanguageDetectorFactory().Create(new DetectorOptions
{
    Logger = msg => Debug.WriteLine(msg),
});

Output:

[LangDetect] Attempting to load resource: 'LangDetect.Resources.Wordlists.English-1000-Wordlist.txt'
[LangDetect] SUCCESS: Loaded 'LangDetect.Resources.Wordlists.English-1000-Wordlist.txt'
[LangDetect] Loaded 1000 words for 'English'

ISO Language Codes

using LangDetect.Utility;

LanguageCode.ToIso(Language.English);   // "en"
LanguageCode.ToIso(Language.Sinhala);   // "si"
LanguageCode.ToIso(Language.Unknown);   // "und"

LanguageCode.FromIso("ar");             // Language.Arabic
LanguageCode.FromIso("zh");             // Language.Mandarin
LanguageCode.FromIso("xyz");            // Language.Unknown

Known Limitations

Mixed-script text — a sentence containing both Japanese and English characters may not detect reliably. Detection is based on the dominant script ratio (configurable via MinNonLatinRatio).
Very short inputs — single words or very short phrases reduce confidence across all stages. Use WordListSize.Large for best results on short text.
Romanized Mandarin (Pinyin) — Pinyin uses common Latin characters that overlap with English. Detection accuracy is moderate; confidence scores are intentionally conservative.
No multi-language detection — a single Detect() call returns one language. Mixed documents are planned for v2.
N-gram profiles are derived from word lists — trigram quality is directly proportional to word list quality and size.

Roadmap

V1 (current)

Unicode range detection for 7 non-Latin scripts
Common word frequency matching
Character trigram profiling
Romanized script detection (Singlish, Tanglish, Pinyin, Romaji)
Configurable word list sizes (Small / Medium / Large)
ISO 639-1 language codes
DI extension (AddLanguageDetector)
Diagnostic logger support
Confidence scoring and IsReliable flag

V2 (planned)

Multi-language detection — ranked candidate list with per-language confidence scores
Code-switching support — detect language changes within a single document
Expanded language support — French, Spanish, Portuguese, German, Russian
Compact ONNX model for Latin-script disambiguation
Dialect identification — Mandarin vs Cantonese, Indian English vs British English
Streaming / span detection over long documents
Calibrated confidence scores via isotonic regression
Proper benchmark suite with labeled test dataset

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for non-trivial changes.

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Commit your changes
Push to the branch and open a Pull Request

Running tests

dotnet test

Generating trigram data

If you update the word lists, regenerate the trigram JSON files using the included tool:

cd LangDetect.TrigramGenerator
dotnet run

Then copy the output from Resources/Trigrams/ into LangDetect/Resources/Trigrams/ and rebuild.

License

This project is licensed under the GNU General Public License v3.0.

Author

Vishal Rashmika GitHub · NuGet

Product	Compatible and additional computed target framework versions.
.NET	net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net8.0
- Microsoft.Extensions.DependencyInjection.Abstractions (>= 10.0.5)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
1.0.1	61	4/7/2026
1.0.0	72	4/5/2026