Skip to main content
GitHub

Evaluations

Evaluate LLM outputs with built-in scorers.

Risicare provides 13 built-in scorers for evaluating LLM outputs across RAG, safety, agent behavior, and general quality.

Overview

Evaluations can be triggered:

  • Automatically on every trace
  • Via API for batch evaluation
  • In the dashboard for ad-hoc analysis

Scorer Categories

RAG Scorers

Evaluate retrieval-augmented generation quality:

ScorerClassDescription
faithfulnessFaithfulnessScorerDoes the response stay faithful to retrieved context?
answer_relevancyAnswerRelevancyScorerIs the response relevant to the query?
context_precisionContextPrecisionScorerHow precise is the retrieved context?
context_recallContextRecallScorerDoes the context contain all needed information?
hallucinationHallucinationScorerDoes the response contain hallucinated information?

Safety Scorers

Detect harmful or inappropriate content:

ScorerClassDescription
toxicityToxicityScorerOffensive or harmful language
biasBiasScorerUnfair or prejudiced content
pii_leakagePIILeakageScorerPersonal identifiable information leakage

Agent Scorers

Evaluate agent behavior:

ScorerClassDescription
tool_correctnessToolCorrectnessScorerDid the agent select appropriate tools?
task_completionTaskCompletionScorerDid the agent complete the task?
goal_accuracyGoalAccuracyScorerHow accurately did the agent achieve the goal?

General Scorers

General quality metrics:

ScorerClassDescription
g_evalGEvalScorerGeneral evaluation (coherence, structure, quality)
factualityFactualityScorerIs the response factually correct?

ScorerInput Fields

All scorers accept a ScorerInput dataclass. Each scorer uses a subset of these fields based on its required_fields.

FieldTypeDescription
trace_idstrUnique identifier for the trace being evaluated (required)
span_idstr | NoneOptional span identifier within the trace
questionstr | NoneThe user's question/query (RAG scorers)
answerstr | NoneThe AI's response/answer to evaluate (RAG scorers)
contextslist[str]List of context passages retrieved for RAG
ground_truthstr | NoneThe expected/correct answer for comparison
expected_toolslist[str]List of tool names the agent should have used
used_toolslist[str]List of tool names the agent actually used
tool_callslist[dict]Detailed tool call information with parameters
task_descriptionstr | NoneDescription of the task assigned to the agent
goalstr | NoneThe goal the agent was trying to achieve
output_textstr | NoneGeneric output text to evaluate
input_textstr | NoneGeneric input text for context
custom_criteriastr | NoneUser-defined evaluation criteria (G-Eval)
evaluation_stepslist[str]Steps to follow during evaluation (G-Eval)
metadatadictAdditional key-value metadata

Required Fields by Scorer

ScorerRequired Fields
faithfulnessanswer, contexts
answer_relevancyquestion, answer
context_precisionquestion, contexts
context_recallcontexts, ground_truth
hallucinationanswer, contexts
toxicityoutput_text
biasoutput_text
pii_leakageoutput_text
tool_correctness(none required, uses expected_tools and used_tools)
task_completiontask_description, output_text
goal_accuracygoal, output_text
g_evaloutput_text
factualityoutput_text

Enabling Evaluations

Automatic Evaluation

Enable for all traces:

risicare.init(
    api_key="rsk-...",
    project_id="proj-...",
    evaluations={
        "scorers": ["faithfulness", "toxicity", "g_eval"],
        "sample_rate": 0.1  # Evaluate 10% of traces
    }
)

API Trigger

Evaluate specific traces via API:

curl -X POST https://app.risicare.ai/v1/evaluations \
  -H "Authorization: Bearer rsk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "abc123",
    "scorers": ["faithfulness", "answer_relevancy", "toxicity"]
  }'

Dashboard

Trigger evaluations from the trace detail view in the dashboard.

Evaluation Results

Results include:

{
  "trace_id": "abc123",
  "evaluations": [
    {
      "scorer": "faithfulness",
      "score": 0.92,
      "passed": true,
      "reasoning": "Response accurately reflects the retrieved context..."
    },
    {
      "scorer": "toxicity",
      "score": 0.01,
      "passed": true,
      "reasoning": "No toxic content detected."
    }
  ]
}

Thresholds

Configure pass/fail thresholds:

risicare.init(
    evaluations={
        "scorers": ["faithfulness", "toxicity"],
        "thresholds": {
            "faithfulness": 0.8,  # Must score >= 0.8
            "toxicity": 0.1       # Must score <= 0.1
        }
    }
)

Alerts

Trigger alerts on evaluation failures:

risicare.init(
    evaluations={
        "scorers": ["toxicity"],
        "alerts": {
            "toxicity": {
                "threshold": 0.3,
                "channel": "slack",
                "webhook": "https://hooks.slack.com/..."
            }
        }
    }
)

Next Steps