PragmaticSegmenterNet 1.0.5

Install-Package PragmaticSegmenterNet -Version 1.0.5
dotnet add package PragmaticSegmenterNet --version 1.0.5
<PackageReference Include="PragmaticSegmenterNet" Version="1.0.5" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add PragmaticSegmenterNet --version 1.0.5
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: PragmaticSegmenterNet, 1.0.5"
#r directive can be used in F# Interactive, C# scripting and .NET Interactive. Copy this into the interactive tool or source code of the script to reference the package.
// Install PragmaticSegmenterNet as a Cake Addin
#addin nuget:?package=PragmaticSegmenterNet&version=1.0.5

// Install PragmaticSegmenterNet as a Cake Tool
#tool nuget:?package=PragmaticSegmenterNet&version=1.0.5
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

This project is a direct port of Pragmatic Segmenter which provides rule-based sentence boundary detection.

Usage

The Segmenter class provides the Segment method which in the simplest usage takes a string:

using PragmaticSegmenterNet;

IReadOnlyList<string> result = Segmenter.Segment("One Sentence. And another sentence.");

// ["One Sentence.", "And another sentence."]

IReadOnlyList<string> result2 = Segmenter.Segment("Anything.", Language.Italian);

// ["Anything"]

The Segment method has a number of optional parameters:

IReadOnlyList<string> Segment(string text, Language language = Language.English, bool cleanText = true, DocumentType documentType = DocumentType.Any)
  • Language - An enum representing the supported languages, the default is English, see the supported languages list below for the list of currently supported languages.
  • CleanText - A boolean indicating whether the input text should be cleaned prior to segmentation. Cleaning removes extra newlines and whitespace. Defaults to true.
  • DocumentType - Used by the text cleaning process to determine which reformatting to apply. For PDFs this handles newlines in the middle of a sentence whereas for HTML documents this will handle HMTL tags. Defaults to any which does not apply any special formatting.

Languages

  • English = 0 (default)
  • Amharic = 1
  • Arabic = 2
  • Armenian = 3
  • Bulgarian = 4
  • Burmese = 5
  • Chinese = 6
  • Danish = 7
  • Dutch = 8
  • French = 9
  • German = 10
  • Greek = 11
  • Hindi = 12
  • Italian = 13
  • Japanese = 14
  • Kazakh = 15 (partial support, potentially only for the Cyrillic form of the alphabet)
  • Persian = 16
  • Polish = 17
  • Russian = 18
  • Spanish = 19
  • Urdu = 20

Releases

1.0.5

  • Fixes an issue with non-breaking spaces in numbered lists

1.0.3

  • Fixes an issue with text containing regex replacement groups, e.g. $0, $1, etc.

1.0.2

  • Fixes an issue with periods following abbreviations.

1.0.1

  • Fixes an issue with single character inputs.

Credit

This project wouldn't be possible without the work done by Pragmatic Segmenter team. Any bugs in the code are entirely my fault.

  • .NETStandard 2.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.5 4,644 7/4/2020
1.0.3 468 2/10/2020
1.0.2 621 10/7/2019
1.0.1 903 11/20/2018
1.0.0 440 9/15/2018