STAF.LLMEval
1.1.0
dotnet add package STAF.LLMEval --version 1.1.0
NuGet\Install-Package STAF.LLMEval -Version 1.1.0
<PackageReference Include="STAF.LLMEval" Version="1.1.0" />
<PackageVersion Include="STAF.LLMEval" Version="1.1.0" />
<PackageReference Include="STAF.LLMEval" />
paket add STAF.LLMEval --version 1.1.0
#r "nuget: STAF.LLMEval, 1.1.0"
#:package STAF.LLMEval@1.1.0
#addin nuget:?package=STAF.LLMEval&version=1.1.0
#tool nuget:?package=STAF.LLMEval&version=1.1.0
AI Response Evaluation Library Help
This document provides a guide on how to use the AI Response Evaluation Library. This library allows you to evaluate the responses of AI applications against expected golden outputs and, optionally, reference documents, using various AI models as judges (currently supporting Ollama, OpenAI, and Gemini).
Getting Started
Installation
Add NuGet Package: In your C# project, add a reference to the
STAF.LLMEvalNuGet package. You can do this using the NuGet Package Manager in Visual Studio or by using the .NET CLI:dotnet add package STAF.LLMEval
Configuration
The library relies on configuration for API endpoints and, in some cases, API keys.
- API Endpoints: You'll need to provide the specific API endpoints for the AI models you intend to use (Ollama, OpenAI, Gemini). These can typically be set within your application's configuration (e.g.,
appsettings.json) or directly in your code when creating theEvaluationRequest. - API Keys: For OpenAI and Gemini, you will need to provide API keys. It is strongly recommended to handle these securely using environment variables or user secrets (for development) instead of hardcoding them or including them directly in the
EvaluationRequestfor production use.
Core Concepts
EvaluationRequest
This class represents the input for evaluating an AI response. It contains the following properties:
Question(string): The original question asked to the AI application.AiResponse(string): The response received from the AI application.GoldenOutput(string): The expected, correct response or the reference document.ProviderType(enum): Specifies the AI provider (Ollama,OpenAI,Gemini).Endpoint(string): The API endpoint for the selectedProviderType.Configuration(Dictionary<string, string>): A dictionary to hold provider-specific configurations, such as API keys (use with caution in production).PassThreshold(double): A numerical threshold (between 0 and 1) that the evaluation score must meet or exceed for the evaluation to be considered a pass.EvaluationType(enum): Specifies the type of evaluation:DirectComparison: Evaluates theAiResponseagainst theGoldenOutputusing internal logic (e.g., exact match, keyword, semantic).LLMAsJudge: Uses the specified AI model to evaluate theAiResponsebased on theQuestionandGoldenOutput(and optionallyIsReferenceDocument).
IsReferenceDocument(bool): A flag indicating whether theGoldenOutputshould be treated as a reference document for theAiResponse.
EvaluationResult
This class represents the output of the evaluation. It contains the following properties:
Score(double): A numerical score (typically between 0 and 1) indicating the quality or correctness of theAiResponse.IsPassed(bool): A boolean indicating whether theScoremeets or exceeds thePassThreshold.Details(string): Additional information or reasoning for the evaluation, often provided by the LLM judge.
IEvaluationService and AdvancedEvaluationService
The IEvaluationService interface defines the contract for performing evaluations. AdvancedEvaluationService is the concrete implementation that handles both direct comparisons and using LLMs as judges.
IAiProvider and Implementations (OllamaProvider, OpenAIProvider, GeminiProvider)
The IAiProvider interface defines how to interact with different AI models. The concrete implementations handle the specific API calls for each provider.
LLMResponseParser
This class contains static methods to parse the responses from the LLM judges (Ollama and Gemini) to extract the evaluation score and reasoning.
Usage
Create an
EvaluationRequestobject: Populate the properties of this object with the necessary information, including the question, AI response, golden output (or reference document), provider type, endpoint, and your desired evaluation settings.Dictionary<string, string> config = new Dictionary<string, string>(); config.Add("ApiKey", "ActualKeyVal"); config.Add("Model", "ModelName"); var request = new EvaluationRequest { Question = "What is the capital of France?", AiResponse = "Paris, France", GoldenOutput = "Paris", ProviderType = ProviderType.Gemini, Endpoint = "your_gemini_endpoint", Configuration = config, // Secure this! PassThreshold = 0.8, EvaluationType = EvaluationType.LLMAsJudge, IsReferenceDoc = false };Instantiate
AdvancedEvaluationService: Create an instance of the evaluation service, providing instances of the provider implementations. It's recommended to use Dependency Injection for managing these dependencies in larger applications.IAiProviderFactory providerFactory = new AiProviderFactory(); IEvaluationService _evalService = new AdvancedEvaluationService(providerFactory);Call
EvaluateAsync: Call theEvaluateAsyncmethod of theAdvancedEvaluationServicewith yourEvaluationRequestobject. This will return anEvaluationResult.EvaluationResult result = await _evalService.EvaluateAsync(request); Console.WriteLine($"Score: {result.Score}"); Console.WriteLine($"Passed: {result.IsPassed}"); Console.WriteLine($"Details: {result.Details}");
Security Considerations
- API Keys: Handle API keys with extreme care. Avoid hardcoding them in your application. Use environment variables, user secrets (for development), or dedicated secret management services (for production).
- Endpoint Security: Ensure the API endpoints you are using are secure (HTTPS).
- Input Validation: Sanitize and validate all input data to prevent potential injection vulnerabilities.
Error Handling
The library includes basic error handling within the provider implementations and the evaluation service. Be prepared to catch exceptions and handle potential issues such as network errors, invalid API responses, or missing configuration. The EvaluationResult.Details property often provides more specific error information.
Contributing
Send me a note in the community if you want to contribute to this library. I am open to suggestions, improvements, and bug fixes.
License
MIT License
Copyright (c) 2025 Sooraj Ramachandran
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.