Scorers
Complete reference for all built-in scorers.
Detailed reference for all 13 built-in evaluation scorers.
Import Path
All scorers are imported from the risicare_evaluation package:
from risicare_evaluation import FaithfulnessScorer, ScorerInputUsage Pattern
All scorers follow the same async pattern:
from risicare_evaluation import FaithfulnessScorer, ScorerInput
scorer = FaithfulnessScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
answer="Paris is the capital of France.",
contexts=["Paris, the capital of France, is known for the Eiffel Tower."],
))
print(result.score) # 0.0 - 1.0
print(result.passed) # True/False
print(result.reasoning) # ExplanationScorerResult Fields
Every scorer returns a ScorerResult with these fields:
| Field | Type | Description |
|---|---|---|
scorer_name | str | Name of the scorer |
scorer_version | str | Version of the scorer |
score | float | Evaluation score (0.0 to 1.0, higher is better) |
passed | bool | Whether score meets threshold |
threshold | float | Threshold used for pass/fail |
reasoning | str | Human-readable explanation |
evidence | list[dict] | Evidence items supporting the evaluation |
sub_scores | dict[str, float] | Component score breakdown |
confidence | float | Confidence in the evaluation (0.0 to 1.0) |
model_used | str | None | LLM model used for evaluation |
prompt_tokens | int | Prompt tokens consumed |
completion_tokens | int | Completion tokens generated |
cost_usd | float | Estimated cost in USD |
duration_ms | float | Time taken in milliseconds |
error | str | None | Error message if evaluation failed |
RAG Scorers
faithfulness
Measures whether the response is faithful to the retrieved context.
Required fields: answer, contexts
from risicare_evaluation import FaithfulnessScorer, ScorerInput
scorer = FaithfulnessScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
answer="Paris is the capital of France.",
contexts=["Paris, the capital of France, is known for the Eiffel Tower."],
))
# result.score = 1.0 (fully faithful)Score Range: 0.0 - 1.0 (higher is better)
answer_relevancy
Measures response relevance to the original query.
Required fields: question, answer
from risicare_evaluation import AnswerRelevancyScorer, ScorerInput
scorer = AnswerRelevancyScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
question="What is the capital of France?",
answer="Paris is the capital of France.",
))
# result.score = 1.0 (fully relevant)Score Range: 0.0 - 1.0 (higher is better)
context_precision
Measures how precise the retrieved context is.
Required fields: question, contexts
from risicare_evaluation import ContextPrecisionScorer, ScorerInput
scorer = ContextPrecisionScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
question="What is the capital of France?",
contexts=["Paris is the capital.", "France has wine.", "Unrelated text."],
))
# result.score = 0.33 (1 of 3 chunks relevant)Score Range: 0.0 - 1.0 (higher is better)
context_recall
Measures if the context contains all needed information.
Required fields: contexts, ground_truth
from risicare_evaluation import ContextRecallScorer, ScorerInput
scorer = ContextRecallScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
contexts=["Paris is the capital of France."],
ground_truth="Paris, population 2.1 million",
))
# result.score = 0.5 (missing population)Score Range: 0.0 - 1.0 (higher is better)
hallucination
Detects hallucinated information not supported by the context.
Required fields: answer, contexts
from risicare_evaluation import HallucinationScorer, ScorerInput
scorer = HallucinationScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
answer="Paris has a population of 10 million.",
contexts=["Paris is the capital of France."],
))
# result.score = 0.8 (likely hallucination)Score Range: 0.0 - 1.0 (lower is better -- high score indicates hallucination)
Safety Scorers
toxicity
Detects offensive, harmful, or inappropriate content.
Required fields: output_text
from risicare_evaluation import ToxicityScorer, ScorerInput
scorer = ToxicityScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="Here's how to solve that problem...",
))
# result.score = 0.02 (low toxicity)Score Range: 0.0 - 1.0 (lower is better)
bias
Detects unfair, prejudiced, or stereotyping content.
Required fields: output_text
from risicare_evaluation import BiasScorer, ScorerInput
scorer = BiasScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="All engineers should consider this approach.",
))
# result.score = 0.05 (low bias)Score Range: 0.0 - 1.0 (lower is better)
pii_leakage
Detects personal identifiable information leakage.
Required fields: output_text
from risicare_evaluation import PIILeakageScorer, ScorerInput
scorer = PIILeakageScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="Contact John at john@email.com or 555-1234.",
))
# result.score = 0.8 (contains PII)Score Range: 0.0 - 1.0 (lower is better)
Agent Scorers
tool_correctness
Evaluates whether the agent selected appropriate tools.
Required fields: none (uses expected_tools and used_tools)
from risicare_evaluation import ToolCorrectnessScorer, ScorerInput
scorer = ToolCorrectnessScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
expected_tools=["weather"],
used_tools=["weather"],
))
# result.score = 1.0 (correct tool)Score Range: 0.0 - 1.0 (higher is better)
task_completion
Evaluates whether the agent completed the task.
Required fields: task_description, output_text
from risicare_evaluation import TaskCompletionScorer, ScorerInput
scorer = TaskCompletionScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
task_description="Find the weather and summarize it",
output_text="The weather in Paris is sunny, 72 F.",
))
# result.score = 1.0 (task complete)Score Range: 0.0 - 1.0 (higher is better)
goal_accuracy
Evaluates how accurately the agent achieved the goal.
Required fields: goal, output_text
from risicare_evaluation import GoalAccuracyScorer, ScorerInput
scorer = GoalAccuracyScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
goal="Provide a detailed weather forecast",
output_text="It's sunny in Paris today.",
))
# result.score = 0.6 (partial goal achievement)Score Range: 0.0 - 1.0 (higher is better)
General Scorers
g_eval
General evaluation scorer for coherence, structure, and quality.
Required fields: output_text
from risicare_evaluation import GEvalScorer, ScorerInput
scorer = GEvalScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="First, we analyze the data. Then, we draw conclusions...",
))
# result.score = 0.95 (highly coherent)You can provide custom criteria and evaluation steps for G-Eval:
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="The quarterly report shows...",
custom_criteria="Professional tone and technical accuracy",
evaluation_steps=[
"Check for professional language",
"Verify technical terms are used correctly",
"Assess overall structure",
],
))Score Range: 0.0 - 1.0 (higher is better)
factuality
Measures factual correctness of the response.
Required fields: output_text
from risicare_evaluation import FactualityScorer, ScorerInput
scorer = FactualityScorer()
result = await scorer.score(ScorerInput(
trace_id="abc123def456789012345678abcdef01",
output_text="The Earth orbits the Sun.",
ground_truth="The Earth orbits the Sun in approximately 365.25 days.",
))
# result.score = 1.0 (factually correct)Score Range: 0.0 - 1.0 (higher is better)
Custom Scorers
Create custom scorers by extending BaseScorer:
from risicare_evaluation.base import (
BaseScorer,
ScorerCategory,
ScorerConfig,
ScorerInput,
ScorerResult,
)
from risicare_evaluation.registry import register_scorer
@register_scorer()
class CustomScorer(BaseScorer[ScorerInput, ScorerConfig]):
name = "custom"
version = "1.0.0"
category = ScorerCategory.GENERAL
required_fields = ["output_text"]
def _default_config(self) -> ScorerConfig:
return ScorerConfig(threshold=0.5)
async def _score_impl(self, input_data: ScorerInput) -> ScorerResult:
# Your evaluation logic here
score = self._calculate_score(input_data.output_text)
return ScorerResult(
scorer_name=self.name,
scorer_version=self.version,
score=score,
passed=score >= self.config.threshold,
threshold=self.config.threshold,
reasoning="Custom evaluation completed.",
)Scorer Configuration
All scorers accept a config object:
from risicare_evaluation.base import ScorerConfig
config = ScorerConfig(
threshold=0.8, # Pass/fail threshold (0.0 to 1.0)
model="gpt-4o-mini", # LLM model for evaluation
include_reasoning=True, # Include reasoning in results
strict_mode=False, # Raise exceptions on errors
max_retries=3, # Max retries for LLM calls
timeout_seconds=60.0, # Timeout for LLM calls
)
scorer = FaithfulnessScorer(config=config)