Linguistics 1.0.1

dotnet add package Linguistics --version 1.0.1
                    
NuGet\Install-Package Linguistics -Version 1.0.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Linguistics" Version="1.0.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Linguistics" Version="1.0.1" />
                    
Directory.Packages.props
<PackageReference Include="Linguistics" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Linguistics --version 1.0.1
                    
#r "nuget: Linguistics, 1.0.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Linguistics@1.0.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Linguistics&version=1.0.1
                    
Install as a Cake Addin
#tool nuget:?package=Linguistics&version=1.0.1
                    
Install as a Cake Tool

Linguistics: High-Performance Arabic NLP Library

NuGet CI .NET License

Linguistics is a specialized .NET library for Arabic text processing, morphology analysis, and root extraction. Engineered for high-throughput scenarios (search engines, indexing pipelines) where allocation and CPU cycles are critical.

πŸ“– Full API reference and architecture deep-dive: MANUAL.md


Table of Contents


Requirements

Target Minimum Version
.NET 6.0, 7.0, 8.0, 9.0, or 10.0
OS Windows, Linux, macOS (any)
Dependencies None (zero external NuGet dependencies)

Installation

dotnet add package Linguistics

Key Features

  • Zero-Allocation Architecture: Built on Span<T>, ref struct, and stackalloc β€” minimal GC pressure even at millions of words/second.
  • Data/Logic Isolation: Linguistic data (roots, patterns, stop words, weak letters, strange words) is decoupled from logic and compiled into the DLL β€” no runtime IO.
  • Advanced Diacritic Engine:
    • Bitmask Scanning: O(1) detection for standard Arabic diacritics.
    • Greedy Normalization: Handles complex Quranic rules (e.g., Hamza + Fatha + Alef β†’ Alef Medda) before stripping simple marks.
    • Three Diacritic Categories: Common (Fatha/Damma/Kasra/Shadda/Sukun/Tanwin), Quranic (Maddah/Hamza Above/Below/Subscript Alef), and Rare/Extended marks.
  • Morphological Analysis (Root Extraction):
    • Trilateral (3-letter) and Quadrilateral (4-letter) roots.
    • Weak letter handling (I'lal) for First (ΩˆΨ΅Ω„ β†’ Ψ΅Ω„), Middle (Ω‚ΩˆΩ„ β†’ Ω‚Ω„), and Last (دعو β†’ Ψ―ΨΉ).
    • Geminated root resolution (Mudha'af: Ω…Ψ―Ψ― β†’ Ω…Ψ―).
    • Hamzated root handling (Ψ§Ω„Ω…Ω‡Ω…ΩˆΨ²) for Hamza in start, middle, and end positions.
    • Pattern-based root extraction via 40+ morphological templates (Awzan).
    • Sun and Moon letter definite article assimilation.
    • Foreign/strange word filtering (e.g., Ω…Ψ§Ω†Ψ―ΩŠΩ„Ψ§, فرنسا).
    • Fuzzy Normalization: Optional Ω‰β†”ΩŠ and ة↔ه normalization for orthographic variation tolerance.
  • Regex-Free Sanitization: Optimized filters for non-alphanumeric removal β€” 10x faster than Regex.Replace.
  • Integer-Packed Lookups: Roots and weak letters packed into ulong/uint for O(1) hash-based lookups with zero string allocation.

Quick Start

1. Removing Diacritics (Hot Path)

using Linguistics;

string text = "بِسْمِ Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­Ω’Ω…ΩŽΩ†Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­ΩΩŠΩ…Ω";

// Returns original string if no diacritics found (zero allocation)
string clean = ArabicDiacritics.RemoveDiacritics(text);
// Output: "Ψ¨Ψ³Ω… Ψ§Ω„Ω„Ω‡ Ψ§Ω„Ψ±Ψ­Ω…Ω† Ψ§Ω„Ψ±Ψ­ΩŠΩ…"

2. Quranic Normalization (Replacement Engine)

using Linguistics;
using Linguistics.Data;

string quranText = "Ω±Ω„Ω’Ψ­ΩŽΩ…Ω’Ψ―Ω"; // Contains Alef Wasla (\u0671)

// Greedy matching replaces compound symbols before stripping remaining marks
string normalized = ArabicDiacritics.Normalize(quranText, ArabicDiacriticsPatterns.Patterns);
// Output: "Ψ§Ω„Ψ­Ω…Ψ―"

3. Root Extraction (Stemming)

using Linguistics;

string word = "ΩŠΩƒΨͺΨ¨ΩˆΩ†"; // "They are writing"

// Runs full pipeline: Clean β†’ Filter β†’ Stem β†’ Resolve
string root = ArabicMorphologyHelper.FormatWord(word, applyFuzzyNormalization: true);
// Output: "ΩƒΨͺΨ¨" (K-T-B root)

4. Text Sanitization (via Facade)

using COMN.Utils;

string dirty = "Hello! @Ψͺَجرُبَة#";

// Removes diacritics AND non-alphanumeric symbols β€” no regex allocations
string clean = TextUtils.RemoveTashkil(dirty, removeNoneAlphaNum: true);
// Output: "HelloΨͺΨ¬Ψ±Ψ¨Ψ©"

5. Zero-Allocation Root Extraction

using Linguistics;

ReadOnlySpan<char> input = "Ψ§Ω„ΩƒΨͺΨ§Ψ¨";
Span<char> outputBuffer = stackalloc char[64];

int length = ArabicMorphologyHelper.FormatWord(input, outputBuffer, applyFuzzyNormalization: alse);
string root = new string(outputBuffer.Slice(0, length));
// Output: "ΩƒΨͺΨ¨" β€” zero heap allocations for processing

⚠️ Exception handling: Methods throw ArgumentException on null/empty input. All public methods are thread-safe.


Performance

This library avoids string.Replace and Regex on hot paths:

Operation Technique Complexity
Diacritic detection Bitmask scan (3 category masks) O(1) per char
Root lookup ulong-packed hashset O(1)
Weak letter lookup uint-packed hashset O(1)
Geminated root lookup uint-packed hashset O(1)
Stop word lookup (Span) Length-bucketed linear scan O(k), k = bucket size
Pattern matching Length-bucketed with pre-calculated root indices O(p), p = patterns per bucket
Text buffers stackalloc / ArrayPool<char> Zero heap alloc
Data loading Compiled into DLL (static fields) Zero IO at runtime

For detailed performance characteristics, buffer sizes, and memory profiles, see MANUAL.md β†’ Advanced Usage.


Architecture

The library follows a SOLID architecture with clear separation of concerns:

Linguistics (namespace)
β”œβ”€β”€ ArabicDiacritics          β€” Diacritic detection, removal, normalization
β”œβ”€β”€ ArabicDiacriticsPattern   β€” Compiled pattern struct for normalization rules
β”œβ”€β”€ ArabicMorphologyHelper    β€” Main orchestrator (FormatWord pipeline)
β”œβ”€β”€ MorphologyResult          β€” Ref struct for zero-allocation word mutation
β”œβ”€β”€ ArabicRootExtractor       β€” Root extraction logic (geminated, hamzated, weak)
β”œβ”€β”€ ArabicAffixStripper       β€” Affix stripping (articles, prefixes, suffixes)
β”œβ”€β”€ ArabicTriRoots            β€” Trilateral root validation (ulong-packed)
β”œβ”€β”€ ArabicQuadRoots           β€” Quadrilateral root validation (ulong-packed)
β”œβ”€β”€ ArabicTriPatterns         β€” Pattern-based root extraction (Awzan)
β”œβ”€β”€ ArabicDuplicates          β€” Geminated root detection (uint-packed)
β”œβ”€β”€ ArabicFirstWeaks          β€” First-position weak letter detection
β”œβ”€β”€ ArabicMiddleWeaks         β€” Middle-position weak letter detection
β”œβ”€β”€ ArabicLastWeaks           β€” Last-position weak letter detection
β”œβ”€β”€ ArabicStopWords           β€” Stop word filtering (length-bucketed)
β”œβ”€β”€ ArabicStrange             β€” Foreign word filtering (length-bucketed)
β”œβ”€β”€ TextPunctuation           β€” Punctuation detection and removal
β”‚
└── Data (namespace)
    β”œβ”€β”€ ArabicConstants       β€” String constants for Arabic characters
    β”œβ”€β”€ ArabicConstantsChar   β€” Char constants for high-performance processing
    β”œβ”€β”€ ArabicRootsData       β€” Trilateral, Quadrilateral, Geminated root sets
    β”œβ”€β”€ ArabicAffixesData     β€” Prefixes, Suffixes, Definite Articles
    β”œβ”€β”€ ArabicStopWordsData   β€” Stop word list
    β”œβ”€β”€ ArabicStrangeData     β€” Foreign word list
    β”œβ”€β”€ ArabicWeaksData       β€” Weak letter data (by position)
    └── ArabicPatternsData    β€” Morphology templates + Diacritic normalization patterns

Acknowledgments

Built upon Khoja's Arabic Stemmer (Khoja & Garside, 1999), extended and optimized for:

  • Zero-allocation design with modern .NET (Span<T>, ref struct)
  • SOLID architecture (ArabicRootExtractor, ArabicAffixStripper)
  • Enhanced accuracy: Pattern X geminated root handling, priority-based weak root resolution
  • Production validation: 97.9% test coverage

Reference: Khoja, S., & Garside, R. (1999). Stemming Arabic Text. Lancaster, UK: Computing Department, Lancaster University.


Contributing

Contributions are welcome! Please open an issue or pull request on GitHub. For local development:

dotnet restore src/Linguistics.slnx
dotnet build src/Linguistics.slnx
dotnet test src/Linguistics.slnx
dotnet test src/Linguistics.slnx --settings src/codecoverage.runsettings --collect "Code Coverage;Format=cobertura"

Test summary: total: 191, failed: 0, succeeded: 191, skipped: 0, duration: 5s

Bonus:

Install DotCov which is a toolkit streams Cobertura XML coverage β€” zero-dependency parser and dotnet global tool. Handles 50 MB+ reports without loading the DOM.

dotnet tool install -g DotCov.Tool

dotcov report src/Linguistics.Tests/TestResults/f1c4408e-8eec-4c51-9965-73219968d341/coverage-2026-05-24.20-30-36.cobertura.xml

File                                                                                 Lines    Line %    Branches  Branch %
--------------------------------------------------------------------------------------------------------------------------
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicRootExtractor.cs          100/147     68.0%      69/102     67.6%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicStrange.cs                  25/34     73.5%       12/16     75.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicStopWords.cs                32/37     86.5%       14/16     87.5%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDiacritics.cs             161/172     93.6%       60/68     88.2%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicTriPatterns.cs              75/80     93.8%       47/50     94.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicFirstWeaks.cs               17/18     94.4%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicMorphologyHelper.cs       129/136     94.9%       75/82     91.5%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicAffixStripper.cs            57/59     96.6%       25/30     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\TextPunctuation.cs              100/102     98.0%       43/44     97.7%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDefiniteArticle.cs            6/6    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDiacriticsPattern.cs        14/14    100.0%         2/2    100.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDuplicates.cs               13/13    100.0%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicLastWeaks.cs                22/22    100.0%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicMiddleWeaks.cs              18/18    100.0%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicPrefixes.cs                   6/6    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicQuadRoots.cs                25/25    100.0%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicSuffixes.cs                   6/6    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicTriRoots.cs                 23/23    100.0%         5/6     83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\MorphologyResult.cs               64/64    100.0%       33/34     97.1%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicAffixesData.cs         12/12    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicPatternsData.cs      105/105    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicRootsData.cs         392/392    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicStopWordsData.cs       30/30    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicStrangeData.cs           4/4    100.0%           -         -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicWeaksData.cs         243/243    100.0%           -         -
--------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                            1679/1768     95.0%     410/480     85.4%

The test suite uses NUnit with FluentAssertions and runs with parallel fixture execution for fast feedback. Tests cover:

  • All public API methods across every class
  • Morphological phenomena (geminated, hamzated, weak, pattern-based)
  • Data integrity (sorting, non-empty, consistency between String and Span overloads)
  • Edge cases (buffer overflow, sun/moon letters, complex affix combinations)

License

MIT License.

Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 is compatible.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net10.0

    • No dependencies.
  • net6.0

    • No dependencies.
  • net7.0

    • No dependencies.
  • net8.0

    • No dependencies.
  • net9.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.1 95 5/24/2026
1.0.0 93 5/24/2026