LangDetect.Net 1.0.4

dotnet add package LangDetect.Net --version 1.0.4
                    
NuGet\Install-Package LangDetect.Net -Version 1.0.4
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="LangDetect.Net" Version="1.0.4" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="LangDetect.Net" Version="1.0.4" />
                    
Directory.Packages.props
<PackageReference Include="LangDetect.Net" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add LangDetect.Net --version 1.0.4
                    
#r "nuget: LangDetect.Net, 1.0.4"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package LangDetect.Net@1.0.4
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=LangDetect.Net&version=1.0.4
                    
Install as a Cake Addin
#tool nuget:?package=LangDetect.Net&version=1.0.4
                    
Install as a Cake Tool

LangDetect.Net

<div align="center"> <img src="https://raw.githubusercontent.com/VishalRashmika/LangDetect/refs/heads/main/Assets/LangDetect-Logo-Transparent-BG.png" alt="LangDetect Logo" width="200" height="200" /> </div>

Lightweight, self-contained language detection for .NET — no cloud, no Python, no runtime dependencies.

NuGet NuGet Downloads License: GPL-3.0 .NET

LangDetect is a .NET 8 class library for detecting the language of a given text string. It uses a multi-stage detection pipeline — Unicode script analysis, common word frequency matching, and character trigram profiling — to identify languages accurately across both native scripts and romanized Latin representations.


Supported Languages

Language Native Script Unicode Range Romanized Detection
English Latin U+0000–U+007F Native (Latin)
Arabic Arabic U+0600–U+06FF ✓ Word list (Arabizi)
Hindi Devanagari U+0900–U+097F ✓ Word list (Hinglish)
Mandarin CJK Ideographs U+4E00–U+9FFF ✓ Word list (Pinyin)
Japanese Hiragana + Katakana U+3040–U+30FF ✓ Word list (Romaji)
Korean Hangul Syllables U+AC00–U+D7AF ✓ Word list (Revised Romanization)
Sinhala Sinhala U+0D80–U+0DFF ✓ Word list (Singlish)
Tamil Tamil U+0B80–U+0BFF ✓ Word list (Tanglish)

Installation

dotnet add package LangDetect.Net

Or via the NuGet Package Manager in Visual Studio — search for LangDetect.Net.


Quick Start

using LangDetect;
using LangDetect.Models;

// create a detector with default options
var factory  = new LanguageDetectorFactory();
var detector = factory.Create();

var result = detector.Detect("The quick brown fox jumps over the lazy dog");

Console.WriteLine(result.Language);    // English
Console.WriteLine(result.IsoCode);     // en
Console.WriteLine(result.Confidence);  // 1.00
Console.WriteLine(result.IsReliable);  // True
Console.WriteLine(result.DetectedBy);  // CommonWordDetectionStage

Detection Result

Every call to Detect() returns a DetectionResult record:

public record DetectionResult
{
    public Language Language     { get; init; }  // detected language or Unknown
    public float    Confidence   { get; init; }  // 0.0 – 1.0
    public bool     IsReliable   { get; init; }  // confidence >= configured threshold
    public string   DetectedBy   { get; init; }  // which pipeline stage fired
    public string   IsoCode      { get; init; }  // ISO 639-1 code e.g. "en", "si"
}

When detection fails or input is too short, DetectionResult.Unknown is returned — Detect() never throws for valid string input.


Configuration

var detector = new LanguageDetectorFactory().Create(new DetectorOptions
{
    ConfidenceThreshold = 0.80f,          // minimum score to be considered reliable
    EnableEarlyExit     = true,           // stop pipeline once confident result found
    WordListSize        = WordListSize.Large, // Small (200) | Medium (500) | Large (1000)
    MinInputLength      = 3,              // inputs shorter than this return Unknown
    MaxTokens           = 500,            // truncate long inputs before analysis
    MinNonLatinRatio    = 0.25f,          // minimum non-Latin ratio to trigger Unicode path
    Logger              = Console.WriteLine, // optional diagnostic logger
});

Word list sizes

Size Words Use case
WordListSize.Small 200 Memory-constrained environments, fast startup
WordListSize.Medium 500 Balanced — recommended default
WordListSize.Large 1000 Best accuracy, especially for short inputs

Detection Pipeline

LangDetect uses a three-stage pipeline. Each stage runs in priority order and the result is returned as soon as a confident detection is made (early exit).

<center> <img src="https://raw.githubusercontent.com/VishalRashmika/LangDetect/refs/heads/main/Assets/process-flow/process-flow-v1.png" style="width:500px"> </center>

Stage details

Stage Priority Technique Best for
UnicodeDetectionStage 1 Script range coverage ratio Arabic, Hindi, Mandarin, Japanese, Korean, Sinhala, Tamil in native script
CommonWordDetectionStage 2 Token frequency matching English, romanized scripts
NGramDetectionStage 3 Character trigram scoring Short inputs, ambiguous text

Examples

Native script detection

detector.Detect("مرحبا كيف حالك اليوم");
// → { Language: Arabic, Confidence: 1.00, IsReliable: true, IsoCode: "ar" }

detector.Detect("नमस्ते आप कैसे हैं");
// → { Language: Hindi, Confidence: 1.00, IsReliable: true, IsoCode: "hi" }

detector.Detect("こんにちは世界");
// → { Language: Japanese, Confidence: 0.81, IsReliable: true, IsoCode: "ja" }

Romanized script detection

detector.Detect("mama giye koheda kiyala amma");
// → { Language: Sinhala, Confidence: 0.85, IsReliable: true, IsoCode: "si" }

detector.Detect("naan pogiren enna romba thanks");
// → { Language: Tamil, Confidence: 0.80, IsReliable: true, IsoCode: "ta" }

Graceful unknown handling

detector.Detect("");          // → DetectionResult.Unknown
detector.Detect("123456");   // → DetectionResult.Unknown
detector.Detect(null);       // → DetectionResult.Unknown (never throws)

Diagnostic logging

var detector = new LanguageDetectorFactory().Create(new DetectorOptions
{
    Logger = msg => Debug.WriteLine(msg),
});

Output:

[LangDetect] Attempting to load resource: 'LangDetect.Resources.Wordlists.English-1000-Wordlist.txt'
[LangDetect] SUCCESS: Loaded 'LangDetect.Resources.Wordlists.English-1000-Wordlist.txt'
[LangDetect] Loaded 1000 words for 'English'

Benchmark Results

On 2026-05-10, I ran a benchmark suite across all 8 supported languages and compared LangDetect.Net against Lingua and Panlingo using their default configurations, without any fine-tuning.

The benchmark dataset used for these results is available on Kaggle: LangDetect.Net v1 Benchmark Dataset.

Key takeaways

  • LangDetect.Net delivered the highest overall accuracy.
  • Lingua reported strong confidence scores, but its accuracy and memory usage were both significantly worse.
  • Panlingo was the fastest library in raw per-sample time, but it trailed LangDetect.Net on accuracy.
  • The confusion matrix shows a strong diagonal with limited cross-language misclassification.
  • Hindi and Japanese accounted for the most confusion, mainly because of overlapping romanization patterns.
  • The three-stage pipeline - Unicode script analysis, word frequency matching, and trigram profiling - gives LangDetect.Net its best balance of accuracy, speed, and memory use.
  • All libraries were tested with their default settings, with no manual tuning.

Comparison Table

Library Correct / Total Failures Accuracy Avg. confidence Avg. time / sample Avg. memory / sample
LangDetect.Net 1320 / 1572 252 83.97% 0.757 0.174 ms 2,404 B
Lingua 796 / 1572 776 50.64% 0.940 0.233 ms 44,948 B
Panlingo 806 / 1572 766 51.27% 0.439 0.014 ms 74 B

Note: confidence scores are library-reported values and are not directly normalized across engines.

Visual Summary

Accuracy heatmap Confusion matrix
Accuracy by language Confusion matrix
Raw performance Memory percentiles
Grouped performance comparison Memory percentile comparison

Additional charts are available in Assets/testBenchmarks/, including per-language radar plots, language winner breakdowns, stacked misclassifications, and top failure summaries.


ISO Language Codes

using LangDetect.Utility;

LanguageCode.ToIso(Language.English);   // "en"
LanguageCode.ToIso(Language.Sinhala);   // "si"
LanguageCode.ToIso(Language.Unknown);   // "und"

LanguageCode.FromIso("ar");             // Language.Arabic
LanguageCode.FromIso("zh");             // Language.Mandarin
LanguageCode.FromIso("xyz");            // Language.Unknown

Known Limitations

  • Mixed-script text
  • Very short inputs
  • Romanized Mandarin (Pinyin)
  • No multi-language detection — a single Detect() call returns one language. Mixed documents are planned for v2.
  • N-gram profiles are derived from word lists — trigram quality is directly proportional to word list quality and size.

Roadmap

Planned features in V2

  • Multi-language detection — ranked candidate list with per-language confidence scores
  • Code-switching support — detect language changes within a single document
  • Expanded language support — French, Spanish, Portuguese, German, Russian
  • Compact ONNX model for Latin-script disambiguation
  • Dialect identification — Mandarin vs Cantonese, Indian English vs British English
  • Streaming / span detection over long documents
  • Calibrated confidence scores via isotonic regression

Contributing

Contributions are welcome. Please open an issue before submitting a pull request for non-trivial changes.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes
  4. Push to the branch and open a Pull Request

Generating trigram data

If you update the word lists, regenerate the trigram JSON files using the included tool:

cd LangDetect.TrigramGenerator
dotnet run

Then copy the output from Resources/Trigrams/ into LangDetect/Resources/Trigrams/ and rebuild.


License

This project is licensed under the GNU General Public License v3.0.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.4 90 6/5/2026
1.0.2 103 4/12/2026
1.0.1 103 4/7/2026
1.0.0 103 4/5/2026