Skip to main content
GitHub

Scorers

Complete reference for all built-in scorers.

Detailed reference for all 13 built-in evaluation scorers.

Import Path

All scorers are imported from the risicare_evaluation package:

from risicare_evaluation import FaithfulnessScorer, ScorerInput

Usage Pattern

All scorers follow the same async pattern:

from risicare_evaluation import FaithfulnessScorer, ScorerInput
 
scorer = FaithfulnessScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    answer="Paris is the capital of France.",
    contexts=["Paris, the capital of France, is known for the Eiffel Tower."],
))
print(result.score)      # 0.0 - 1.0
print(result.passed)     # True/False
print(result.reasoning)  # Explanation

ScorerResult Fields

Every scorer returns a ScorerResult with these fields:

FieldTypeDescription
scorer_namestrName of the scorer
scorer_versionstrVersion of the scorer
scorefloatEvaluation score (0.0 to 1.0, higher is better)
passedboolWhether score meets threshold
thresholdfloatThreshold used for pass/fail
reasoningstrHuman-readable explanation
evidencelist[dict]Evidence items supporting the evaluation
sub_scoresdict[str, float]Component score breakdown
confidencefloatConfidence in the evaluation (0.0 to 1.0)
model_usedstr | NoneLLM model used for evaluation
prompt_tokensintPrompt tokens consumed
completion_tokensintCompletion tokens generated
cost_usdfloatEstimated cost in USD
duration_msfloatTime taken in milliseconds
errorstr | NoneError message if evaluation failed

RAG Scorers

faithfulness

Measures whether the response is faithful to the retrieved context.

Required fields: answer, contexts

from risicare_evaluation import FaithfulnessScorer, ScorerInput
 
scorer = FaithfulnessScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    answer="Paris is the capital of France.",
    contexts=["Paris, the capital of France, is known for the Eiffel Tower."],
))
# result.score = 1.0 (fully faithful)

Score Range: 0.0 - 1.0 (higher is better)

answer_relevancy

Measures response relevance to the original query.

Required fields: question, answer

from risicare_evaluation import AnswerRelevancyScorer, ScorerInput
 
scorer = AnswerRelevancyScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    question="What is the capital of France?",
    answer="Paris is the capital of France.",
))
# result.score = 1.0 (fully relevant)

Score Range: 0.0 - 1.0 (higher is better)

context_precision

Measures how precise the retrieved context is.

Required fields: question, contexts

from risicare_evaluation import ContextPrecisionScorer, ScorerInput
 
scorer = ContextPrecisionScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    question="What is the capital of France?",
    contexts=["Paris is the capital.", "France has wine.", "Unrelated text."],
))
# result.score = 0.33 (1 of 3 chunks relevant)

Score Range: 0.0 - 1.0 (higher is better)

context_recall

Measures if the context contains all needed information.

Required fields: contexts, ground_truth

from risicare_evaluation import ContextRecallScorer, ScorerInput
 
scorer = ContextRecallScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    contexts=["Paris is the capital of France."],
    ground_truth="Paris, population 2.1 million",
))
# result.score = 0.5 (missing population)

Score Range: 0.0 - 1.0 (higher is better)

hallucination

Detects hallucinated information not supported by the context.

Required fields: answer, contexts

from risicare_evaluation import HallucinationScorer, ScorerInput
 
scorer = HallucinationScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    answer="Paris has a population of 10 million.",
    contexts=["Paris is the capital of France."],
))
# result.score = 0.8 (likely hallucination)

Score Range: 0.0 - 1.0 (lower is better -- high score indicates hallucination)

Safety Scorers

toxicity

Detects offensive, harmful, or inappropriate content.

Required fields: output_text

from risicare_evaluation import ToxicityScorer, ScorerInput
 
scorer = ToxicityScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="Here's how to solve that problem...",
))
# result.score = 0.02 (low toxicity)

Score Range: 0.0 - 1.0 (lower is better)

bias

Detects unfair, prejudiced, or stereotyping content.

Required fields: output_text

from risicare_evaluation import BiasScorer, ScorerInput
 
scorer = BiasScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="All engineers should consider this approach.",
))
# result.score = 0.05 (low bias)

Score Range: 0.0 - 1.0 (lower is better)

pii_leakage

Detects personal identifiable information leakage.

Required fields: output_text

from risicare_evaluation import PIILeakageScorer, ScorerInput
 
scorer = PIILeakageScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="Contact John at john@email.com or 555-1234.",
))
# result.score = 0.8 (contains PII)

Score Range: 0.0 - 1.0 (lower is better)

Agent Scorers

tool_correctness

Evaluates whether the agent selected appropriate tools.

Required fields: none (uses expected_tools and used_tools)

from risicare_evaluation import ToolCorrectnessScorer, ScorerInput
 
scorer = ToolCorrectnessScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    expected_tools=["weather"],
    used_tools=["weather"],
))
# result.score = 1.0 (correct tool)

Score Range: 0.0 - 1.0 (higher is better)

task_completion

Evaluates whether the agent completed the task.

Required fields: task_description, output_text

from risicare_evaluation import TaskCompletionScorer, ScorerInput
 
scorer = TaskCompletionScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    task_description="Find the weather and summarize it",
    output_text="The weather in Paris is sunny, 72 F.",
))
# result.score = 1.0 (task complete)

Score Range: 0.0 - 1.0 (higher is better)

goal_accuracy

Evaluates how accurately the agent achieved the goal.

Required fields: goal, output_text

from risicare_evaluation import GoalAccuracyScorer, ScorerInput
 
scorer = GoalAccuracyScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    goal="Provide a detailed weather forecast",
    output_text="It's sunny in Paris today.",
))
# result.score = 0.6 (partial goal achievement)

Score Range: 0.0 - 1.0 (higher is better)

General Scorers

g_eval

General evaluation scorer for coherence, structure, and quality.

Required fields: output_text

from risicare_evaluation import GEvalScorer, ScorerInput
 
scorer = GEvalScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="First, we analyze the data. Then, we draw conclusions...",
))
# result.score = 0.95 (highly coherent)

You can provide custom criteria and evaluation steps for G-Eval:

result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="The quarterly report shows...",
    custom_criteria="Professional tone and technical accuracy",
    evaluation_steps=[
        "Check for professional language",
        "Verify technical terms are used correctly",
        "Assess overall structure",
    ],
))

Score Range: 0.0 - 1.0 (higher is better)

factuality

Measures factual correctness of the response.

Required fields: output_text

from risicare_evaluation import FactualityScorer, ScorerInput
 
scorer = FactualityScorer()
result = await scorer.score(ScorerInput(
    trace_id="abc123def456789012345678abcdef01",
    output_text="The Earth orbits the Sun.",
    ground_truth="The Earth orbits the Sun in approximately 365.25 days.",
))
# result.score = 1.0 (factually correct)

Score Range: 0.0 - 1.0 (higher is better)

Custom Scorers

Create custom scorers by extending BaseScorer:

from risicare_evaluation.base import (
    BaseScorer,
    ScorerCategory,
    ScorerConfig,
    ScorerInput,
    ScorerResult,
)
from risicare_evaluation.registry import register_scorer
 
@register_scorer()
class CustomScorer(BaseScorer[ScorerInput, ScorerConfig]):
    name = "custom"
    version = "1.0.0"
    category = ScorerCategory.GENERAL
    required_fields = ["output_text"]
 
    def _default_config(self) -> ScorerConfig:
        return ScorerConfig(threshold=0.5)
 
    async def _score_impl(self, input_data: ScorerInput) -> ScorerResult:
        # Your evaluation logic here
        score = self._calculate_score(input_data.output_text)
        return ScorerResult(
            scorer_name=self.name,
            scorer_version=self.version,
            score=score,
            passed=score >= self.config.threshold,
            threshold=self.config.threshold,
            reasoning="Custom evaluation completed.",
        )

Scorer Configuration

All scorers accept a config object:

from risicare_evaluation.base import ScorerConfig
 
config = ScorerConfig(
    threshold=0.8,           # Pass/fail threshold (0.0 to 1.0)
    model="gpt-4o-mini",     # LLM model for evaluation
    include_reasoning=True,  # Include reasoning in results
    strict_mode=False,       # Raise exceptions on errors
    max_retries=3,           # Max retries for LLM calls
    timeout_seconds=60.0,    # Timeout for LLM calls
)
 
scorer = FaithfulnessScorer(config=config)

Next Steps