Install-Package PragmaticSegmenterNet -Version 1.0.5
dotnet add package PragmaticSegmenterNet --version 1.0.5
<PackageReference Include="PragmaticSegmenterNet" Version="1.0.5" />
paket add PragmaticSegmenterNet --version 1.0.5
#r "nuget: PragmaticSegmenterNet, 1.0.5"
// Install PragmaticSegmenterNet as a Cake Addin #addin nuget:?package=PragmaticSegmenterNet&version=1.0.5 // Install PragmaticSegmenterNet as a Cake Tool #tool nuget:?package=PragmaticSegmenterNet&version=1.0.5
This project is a direct port of Pragmatic Segmenter which provides rule-based sentence boundary detection.
Segmenter class provides the
Segment method which in the simplest usage takes a string:
using PragmaticSegmenterNet; IReadOnlyList<string> result = Segmenter.Segment("One Sentence. And another sentence."); // ["One Sentence.", "And another sentence."] IReadOnlyList<string> result2 = Segmenter.Segment("Anything.", Language.Italian); // ["Anything"]
The Segment method has a number of optional parameters:
IReadOnlyList<string> Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
- Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
- CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to
- DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.
- English = 0 (default)
- Amharic = 1
- Arabic = 2
- Armenian = 3
- Bulgarian = 4
- Burmese = 5
- Chinese = 6
- Danish = 7
- Dutch = 8
- French = 9
- German = 10
- Greek = 11
- Hindi = 12
- Italian = 13
- Japanese = 14
- Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
- Persian = 16
- Polish = 17
- Russian = 18
- Spanish = 19
- Urdu = 20
- Fixes an issue with non-breaking spaces in numbered lists
- Fixes an issue with text containing regex replacement groups, e.g.
- Fixes an issue with periods following abbreviations.
- Fixes an issue with single character inputs.
This project wouldn't be possible without the work done by Pragmatic Segmenter team. Any bugs in the code are entirely my fault.
- No dependencies.
This package is not used by any NuGet packages.
This package is not used by any popular GitHub repositories.