Linguistics 1.0.1
dotnet add package Linguistics --version 1.0.1
NuGet\Install-Package Linguistics -Version 1.0.1
<PackageReference Include="Linguistics" Version="1.0.1" />
<PackageVersion Include="Linguistics" Version="1.0.1" />
<PackageReference Include="Linguistics" />
paket add Linguistics --version 1.0.1
#r "nuget: Linguistics, 1.0.1"
#:package Linguistics@1.0.1
#addin nuget:?package=Linguistics&version=1.0.1
#tool nuget:?package=Linguistics&version=1.0.1
Linguistics: High-Performance Arabic NLP Library
Linguistics is a specialized .NET library for Arabic text processing, morphology analysis, and root extraction. Engineered for high-throughput scenarios (search engines, indexing pipelines) where allocation and CPU cycles are critical.
π Full API reference and architecture deep-dive: MANUAL.md
Table of Contents
- Requirements
- Installation
- Key Features
- Quick Start
- Performance
- Architecture
- Acknowledgments
- Contributing
- License
Requirements
| Target | Minimum Version |
|---|---|
| .NET | 6.0, 7.0, 8.0, 9.0, or 10.0 |
| OS | Windows, Linux, macOS (any) |
| Dependencies | None (zero external NuGet dependencies) |
Installation
dotnet add package Linguistics
Key Features
- Zero-Allocation Architecture: Built on
Span<T>,ref struct, andstackallocβ minimal GC pressure even at millions of words/second. - Data/Logic Isolation: Linguistic data (roots, patterns, stop words, weak letters, strange words) is decoupled from logic and compiled into the DLL β no runtime IO.
- Advanced Diacritic Engine:
- Bitmask Scanning: O(1) detection for standard Arabic diacritics.
- Greedy Normalization: Handles complex Quranic rules (e.g.,
Hamza+Fatha+AlefβAlef Medda) before stripping simple marks. - Three Diacritic Categories: Common (Fatha/Damma/Kasra/Shadda/Sukun/Tanwin), Quranic (Maddah/Hamza Above/Below/Subscript Alef), and Rare/Extended marks.
- Morphological Analysis (Root Extraction):
- Trilateral (3-letter) and Quadrilateral (4-letter) roots.
- Weak letter handling (I'lal) for First (
ΩΨ΅ΩβΨ΅Ω), Middle (ΩΩΩβΩΩ), and Last (Ψ―ΨΉΩβΨ―ΨΉ). - Geminated root resolution (Mudha'af:
Ω Ψ―Ψ―βΩ Ψ―). - Hamzated root handling (Ψ§ΩΩ ΩΩ ΩΨ²) for Hamza in start, middle, and end positions.
- Pattern-based root extraction via 40+ morphological templates (Awzan).
- Sun and Moon letter definite article assimilation.
- Foreign/strange word filtering (e.g.,
Ω Ψ§ΩΨ―ΩΩΨ§,ΩΨ±ΩΨ³Ψ§). - Fuzzy Normalization: Optional
ΩβΩandΨ©βΩnormalization for orthographic variation tolerance.
- Regex-Free Sanitization: Optimized filters for non-alphanumeric removal β 10x faster than
Regex.Replace. - Integer-Packed Lookups: Roots and weak letters packed into
ulong/uintfor O(1) hash-based lookups with zero string allocation.
Quick Start
1. Removing Diacritics (Hot Path)
using Linguistics;
string text = "Ψ¨ΩΨ³ΩΩ
Ω Ψ§ΩΩΩΩΩΩ Ψ§ΩΨ±ΩΩΨΩΩ
ΩΩΩ Ψ§ΩΨ±ΩΩΨΩΩΩ
Ω";
// Returns original string if no diacritics found (zero allocation)
string clean = ArabicDiacritics.RemoveDiacritics(text);
// Output: "Ψ¨Ψ³Ω
Ψ§ΩΩΩ Ψ§ΩΨ±ΨΩ
Ω Ψ§ΩΨ±ΨΩΩ
"
2. Quranic Normalization (Replacement Engine)
using Linguistics;
using Linguistics.Data;
string quranText = "Ω±ΩΩΨΩΩ
ΩΨ―Ω"; // Contains Alef Wasla (\u0671)
// Greedy matching replaces compound symbols before stripping remaining marks
string normalized = ArabicDiacritics.Normalize(quranText, ArabicDiacriticsPatterns.Patterns);
// Output: "Ψ§ΩΨΩ
Ψ―"
3. Root Extraction (Stemming)
using Linguistics;
string word = "ΩΩΨͺΨ¨ΩΩ"; // "They are writing"
// Runs full pipeline: Clean β Filter β Stem β Resolve
string root = ArabicMorphologyHelper.FormatWord(word, applyFuzzyNormalization: true);
// Output: "ΩΨͺΨ¨" (K-T-B root)
4. Text Sanitization (via Facade)
using COMN.Utils;
string dirty = "Hello! @ΨͺΩΨ¬Ψ±ΩΨ¨ΩΨ©#";
// Removes diacritics AND non-alphanumeric symbols β no regex allocations
string clean = TextUtils.RemoveTashkil(dirty, removeNoneAlphaNum: true);
// Output: "HelloΨͺΨ¬Ψ±Ψ¨Ψ©"
5. Zero-Allocation Root Extraction
using Linguistics;
ReadOnlySpan<char> input = "Ψ§ΩΩΨͺΨ§Ψ¨";
Span<char> outputBuffer = stackalloc char[64];
int length = ArabicMorphologyHelper.FormatWord(input, outputBuffer, applyFuzzyNormalization: alse);
string root = new string(outputBuffer.Slice(0, length));
// Output: "ΩΨͺΨ¨" β zero heap allocations for processing
β οΈ Exception handling: Methods throw
ArgumentExceptionon null/empty input. All public methods are thread-safe.
Performance
This library avoids string.Replace and Regex on hot paths:
| Operation | Technique | Complexity |
|---|---|---|
| Diacritic detection | Bitmask scan (3 category masks) | O(1) per char |
| Root lookup | ulong-packed hashset |
O(1) |
| Weak letter lookup | uint-packed hashset |
O(1) |
| Geminated root lookup | uint-packed hashset |
O(1) |
| Stop word lookup (Span) | Length-bucketed linear scan | O(k), k = bucket size |
| Pattern matching | Length-bucketed with pre-calculated root indices | O(p), p = patterns per bucket |
| Text buffers | stackalloc / ArrayPool<char> |
Zero heap alloc |
| Data loading | Compiled into DLL (static fields) | Zero IO at runtime |
For detailed performance characteristics, buffer sizes, and memory profiles, see MANUAL.md β Advanced Usage.
Architecture
The library follows a SOLID architecture with clear separation of concerns:
Linguistics (namespace)
βββ ArabicDiacritics β Diacritic detection, removal, normalization
βββ ArabicDiacriticsPattern β Compiled pattern struct for normalization rules
βββ ArabicMorphologyHelper β Main orchestrator (FormatWord pipeline)
βββ MorphologyResult β Ref struct for zero-allocation word mutation
βββ ArabicRootExtractor β Root extraction logic (geminated, hamzated, weak)
βββ ArabicAffixStripper β Affix stripping (articles, prefixes, suffixes)
βββ ArabicTriRoots β Trilateral root validation (ulong-packed)
βββ ArabicQuadRoots β Quadrilateral root validation (ulong-packed)
βββ ArabicTriPatterns β Pattern-based root extraction (Awzan)
βββ ArabicDuplicates β Geminated root detection (uint-packed)
βββ ArabicFirstWeaks β First-position weak letter detection
βββ ArabicMiddleWeaks β Middle-position weak letter detection
βββ ArabicLastWeaks β Last-position weak letter detection
βββ ArabicStopWords β Stop word filtering (length-bucketed)
βββ ArabicStrange β Foreign word filtering (length-bucketed)
βββ TextPunctuation β Punctuation detection and removal
β
βββ Data (namespace)
βββ ArabicConstants β String constants for Arabic characters
βββ ArabicConstantsChar β Char constants for high-performance processing
βββ ArabicRootsData β Trilateral, Quadrilateral, Geminated root sets
βββ ArabicAffixesData β Prefixes, Suffixes, Definite Articles
βββ ArabicStopWordsData β Stop word list
βββ ArabicStrangeData β Foreign word list
βββ ArabicWeaksData β Weak letter data (by position)
βββ ArabicPatternsData β Morphology templates + Diacritic normalization patterns
Acknowledgments
Built upon Khoja's Arabic Stemmer (Khoja & Garside, 1999), extended and optimized for:
- Zero-allocation design with modern .NET (
Span<T>,ref struct) - SOLID architecture (
ArabicRootExtractor,ArabicAffixStripper) - Enhanced accuracy: Pattern X geminated root handling, priority-based weak root resolution
- Production validation: 97.9% test coverage
Reference: Khoja, S., & Garside, R. (1999). Stemming Arabic Text. Lancaster, UK: Computing Department, Lancaster University.
Contributing
Contributions are welcome! Please open an issue or pull request on GitHub. For local development:
dotnet restore src/Linguistics.slnx
dotnet build src/Linguistics.slnx
dotnet test src/Linguistics.slnx
dotnet test src/Linguistics.slnx --settings src/codecoverage.runsettings --collect "Code Coverage;Format=cobertura"
Test summary: total: 191, failed: 0, succeeded: 191, skipped: 0, duration: 5s
Bonus:
Install DotCov which is a toolkit streams Cobertura XML coverage β zero-dependency parser and dotnet global tool. Handles 50 MB+ reports without loading the DOM.
dotnet tool install -g DotCov.Tool
dotcov report src/Linguistics.Tests/TestResults/f1c4408e-8eec-4c51-9965-73219968d341/coverage-2026-05-24.20-30-36.cobertura.xml
File Lines Line % Branches Branch %
--------------------------------------------------------------------------------------------------------------------------
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicRootExtractor.cs 100/147 68.0% 69/102 67.6%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicStrange.cs 25/34 73.5% 12/16 75.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicStopWords.cs 32/37 86.5% 14/16 87.5%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDiacritics.cs 161/172 93.6% 60/68 88.2%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicTriPatterns.cs 75/80 93.8% 47/50 94.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicFirstWeaks.cs 17/18 94.4% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicMorphologyHelper.cs 129/136 94.9% 75/82 91.5%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicAffixStripper.cs 57/59 96.6% 25/30 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\TextPunctuation.cs 100/102 98.0% 43/44 97.7%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDefiniteArticle.cs 6/6 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDiacriticsPattern.cs 14/14 100.0% 2/2 100.0%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicDuplicates.cs 13/13 100.0% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicLastWeaks.cs 22/22 100.0% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicMiddleWeaks.cs 18/18 100.0% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicPrefixes.cs 6/6 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicQuadRoots.cs 25/25 100.0% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicSuffixes.cs 6/6 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\ArabicTriRoots.cs 23/23 100.0% 5/6 83.3%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\MorphologyResult.cs 64/64 100.0% 33/34 97.1%
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicAffixesData.cs 12/12 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicPatternsData.cs 105/105 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicRootsData.cs 392/392 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicStopWordsData.cs 30/30 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicStrangeData.cs 4/4 100.0% - -
D:\dev-projects\dotnet\Linguistics\src\Linguistics\Data\ArabicWeaksData.cs 243/243 100.0% - -
--------------------------------------------------------------------------------------------------------------------------
TOTAL 1679/1768 95.0% 410/480 85.4%
The test suite uses NUnit with FluentAssertions and runs with parallel fixture execution for fast feedback. Tests cover:
- All public API methods across every class
- Morphological phenomena (geminated, hamzated, weak, pattern-based)
- Data integrity (sorting, non-empty, consistency between String and Span overloads)
- Edge cases (buffer overflow, sun/moon letters, complex affix combinations)
License
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- No dependencies.
-
net6.0
- No dependencies.
-
net7.0
- No dependencies.
-
net8.0
- No dependencies.
-
net9.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.