SemanticKernel.Reranker.BM25
1.0.0
dotnet add package SemanticKernel.Reranker.BM25 --version 1.0.0
NuGet\Install-Package SemanticKernel.Reranker.BM25 -Version 1.0.0
<PackageReference Include="SemanticKernel.Reranker.BM25" Version="1.0.0" />
<PackageVersion Include="SemanticKernel.Reranker.BM25" Version="1.0.0" />
<PackageReference Include="SemanticKernel.Reranker.BM25" />
paket add SemanticKernel.Reranker.BM25 --version 1.0.0
#r "nuget: SemanticKernel.Reranker.BM25, 1.0.0"
#:package SemanticKernel.Reranker.BM25@1.0.0
#addin nuget:?package=SemanticKernel.Reranker.BM25&version=1.0.0
#tool nuget:?package=SemanticKernel.Reranker.BM25&version=1.0.0
BM25 Reranker
A robust C# library for reranking search results using the classic BM25 algorithm with advanced natural language processing, leveraging the Catalyst NLP library.
Table of Contents
- Introduction
- Why BM25 with NLP?
- Features
- Getting Started
- Usage Example
- How It Works
- Customization
- License
Introduction
This project provides a flexible C# implementation of BM25, a state-of-the-art ranking function used by search engines, enhanced with advanced natural language processing capabilities.
With this library, you can rerank search results or candidate passages using sophisticated tokenization, lemmatization, stop word removal, and multi-language support through the Catalyst NLP library.
Why BM25 with NLP?
Traditional BM25 relies on exact token overlap between query and document. However, raw text processing can be noisy:
- Text contains punctuation, stop words, and varying word forms.
- "running" vs "run", "cars" vs "car", mixed case, etc.
- Different languages require different processing approaches.
By incorporating advanced NLP preprocessing:
- The reranker uses lemmatization to normalize word forms (running → run).
- Automatic language detection ensures proper processing for multilingual content.
- Stop words are filtered out to focus on meaningful terms.
- Part-of-speech tagging helps identify important content words.
NLP preprocessing enhances the precision and effectiveness of traditional BM25 scoring.
Features
- BM25 core algorithm: Highly tunable (
k1
,b
parameters). - Advanced NLP processing: Powered by the Catalyst library for tokenization and linguistic analysis.
- Multi-language support: Automatic language detection with support for English, French, German, and more.
- Intelligent preprocessing: Lemmatization, stop word removal, and part-of-speech filtering.
- Asynchronous processing: Async tokenization and scoring for high performance.
- Easy to extend: Customizable parameters and configurable language models.
Getting Started
Prerequisites
- .NET 8.0+
Installation
- Install the package via NuGet Package Manager or via the .NET CLI:
Usage Example
using SemanticKernel.Reranker.BM25;
// Sample documents to index
var documents = new List<string>
{
"The quick brown fox jumps over the lazy dog.",
"A brown dog jumps over another dog.",
"The quick brown fox.",
"Machine learning is a subset of artificial intelligence.",
"Natural language processing helps computers understand human language."
};
// Create BM25 reranker with default parameters (k1=1.5, b=0.75)
var bm25 = new BM25Reranker(documents);
// Rank documents for a query
var results = await bm25.RankAsync("quick brown fox", topN: 3);
// Display results
foreach (var (documentIndex, score) in results)
{
Console.WriteLine($"Document #{documentIndex}: Score = {score:F4}");
Console.WriteLine($"Content: {documents[documentIndex]}");
Console.WriteLine();
}
How It Works
Document Preprocessing: Each document is processed through the Catalyst NLP pipeline:
- Automatic language detection
- Tokenization into individual words
- Lemmatization to normalize word forms
- Stop word removal
- Part-of-speech filtering (removes punctuation and symbols)
Index Building: The system builds an inverted index with:
- Document frequency (DF) for each term
- Document lengths and average document length
- Preprocessed token lists for efficient scoring
Query Processing: Query text undergoes the same NLP preprocessing as documents
BM25 Scoring: For each document, calculates the BM25 score using:
- Term frequency (TF) in the document
- Inverse document frequency (IDF)
- Document length normalization
- Tunable parameters k1 and b
Customization
BM25 Parameters
You can customize the BM25 algorithm behavior:
// Custom k1 and b parameters
var bm25 = new BM25Reranker(documents, k1: 2.0, b: 0.5);
- k1 (default: 1.5): Controls term frequency saturation. Higher values give more weight to repeated terms.
- b (default: 0.75): Controls document length normalization. 0 = no normalization, 1 = full normalization.
Language Support
The library automatically detects document language and applies appropriate NLP models. Supported languages include:
- English
- French
- German
- Additional languages supported by Catalyst
License
This project is licensed under the MIT License - see the LICENSE.txt file for details.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Catalyst (>= 1.0.54164)
- Catalyst.Models.English (>= 1.0.30952)
- Catalyst.Models.French (>= 1.0.30952)
- Catalyst.Models.German (>= 1.0.30952)
- Microsoft.SemanticKernel.Core (>= 1.61.0)
- System.Numerics.Tensors (>= 9.0.7)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last Updated |
---|---|---|
1.0.0 | 12 | 8/19/2025 |
0.0.2 | 8 | 8/20/2025 |
0.0.1 | 12 | 8/20/2025 |
0.0.1-alpha01 | 10 | 8/19/2025 |