Evaluations

Evaluate LLM outputs with built-in scorers.

Risicare provides 13 built-in scorers for evaluating LLM outputs across RAG, safety, agent behavior, and general quality.

Overview

Evaluations can be triggered:

Automatically on every trace
Via API for batch evaluation
In the dashboard for ad-hoc analysis

Scorer Categories

RAG Scorers

Evaluate retrieval-augmented generation quality:

Scorer	Class	Description
`faithfulness`	`FaithfulnessScorer`	Does the response stay faithful to retrieved context?
`answer_relevancy`	`AnswerRelevancyScorer`	Is the response relevant to the query?
`context_precision`	`ContextPrecisionScorer`	How precise is the retrieved context?
`context_recall`	`ContextRecallScorer`	Does the context contain all needed information?
`hallucination`	`HallucinationScorer`	Does the response contain hallucinated information?

Safety Scorers

Detect harmful or inappropriate content:

Scorer	Class	Description
`toxicity`	`ToxicityScorer`	Offensive or harmful language
`bias`	`BiasScorer`	Unfair or prejudiced content
`pii_leakage`	`PIILeakageScorer`	Personal identifiable information leakage

Agent Scorers

Evaluate agent behavior:

Scorer	Class	Description
`tool_correctness`	`ToolCorrectnessScorer`	Did the agent select appropriate tools?
`task_completion`	`TaskCompletionScorer`	Did the agent complete the task?
`goal_accuracy`	`GoalAccuracyScorer`	How accurately did the agent achieve the goal?

General Scorers

General quality metrics:

Scorer	Class	Description
`g_eval`	`GEvalScorer`	General evaluation (coherence, structure, quality)
`factuality`	`FactualityScorer`	Is the response factually correct?

ScorerInput Fields

All scorers accept a ScorerInput dataclass. Each scorer uses a subset of these fields based on its required_fields.

Field	Type	Description
`trace_id`	`str`	Unique identifier for the trace being evaluated (required)
`span_id`	`str \| None`	Optional span identifier within the trace
`question`	`str \| None`	The user's question/query (RAG scorers)
`answer`	`str \| None`	The AI's response/answer to evaluate (RAG scorers)
`contexts`	`list[str]`	List of context passages retrieved for RAG
`ground_truth`	`str \| None`	The expected/correct answer for comparison
`expected_tools`	`list[str]`	List of tool names the agent should have used
`used_tools`	`list[str]`	List of tool names the agent actually used
`tool_calls`	`list[dict]`	Detailed tool call information with parameters
`task_description`	`str \| None`	Description of the task assigned to the agent
`goal`	`str \| None`	The goal the agent was trying to achieve
`output_text`	`str \| None`	Generic output text to evaluate
`input_text`	`str \| None`	Generic input text for context
`custom_criteria`	`str \| None`	User-defined evaluation criteria (G-Eval)
`evaluation_steps`	`list[str]`	Steps to follow during evaluation (G-Eval)
`metadata`	`dict`	Additional key-value metadata

Required Fields by Scorer

Scorer	Required Fields
`faithfulness`	`answer`, `contexts`
`answer_relevancy`	`question`, `answer`
`context_precision`	`question`, `contexts`
`context_recall`	`contexts`, `ground_truth`
`hallucination`	`answer`, `contexts`
`toxicity`	`output_text`
`bias`	`output_text`
`pii_leakage`	`output_text`
`tool_correctness`	(none required, uses `expected_tools` and `used_tools`)
`task_completion`	`task_description`, `output_text`
`goal_accuracy`	`goal`, `output_text`
`g_eval`	`output_text`
`factuality`	`output_text`

Enabling Evaluations

API Trigger

Evaluate specific traces via API:

curl -X POST https://app.risicare.ai/api/v1/evaluations \
  -H "Authorization: Bearer rsk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "abc123",
    "scorers": ["faithfulness", "answer_relevancy", "toxicity"]
  }'

Dashboard

Trigger evaluations from the trace detail view in the dashboard.

Evaluation Results

Results include:

{
  "trace_id": "abc123",
  "evaluations": [
    {
      "scorer": "faithfulness",
      "score": 0.92,
      "passed": true,
      "reasoning": "Response accurately reflects the retrieved context..."
    },
    {
      "scorer": "toxicity",
      "score": 0.01,
      "passed": true,
      "reasoning": "No toxic content detected."
    }
  ]
}

Thresholds

Configure pass/fail thresholds:

risicare.init(
    evaluations={
        "scorers": ["faithfulness", "toxicity"],
        "thresholds": {
            "faithfulness": 0.8,  # Must score >= 0.8
            "toxicity": 0.1       # Must score <= 0.1
        }
    }
)

Alerts

Trigger alerts on evaluation failures:

risicare.init(
    evaluations={
        "scorers": ["toxicity"],
        "alerts": {
            "toxicity": {
                "threshold": 0.3,
                "channel": "slack",
                "webhook": "https://hooks.slack.com/..."
            }
        }
    }
)

Next Steps

Scorers Reference

Complete scorer documentation

Learn more

Cost Tracking

Track LLM costs

Learn more

Edit this page on GitHub

PreviousAgents NextScorers