Evaluations
Evaluate LLM outputs with built-in scorers.
Risicare provides 13 built-in scorers for evaluating LLM outputs across RAG, safety, agent behavior, and general quality.
Overview
Evaluations can be triggered:
- Automatically on every trace
- Via API for batch evaluation
- In the dashboard for ad-hoc analysis
Scorer Categories
RAG Scorers
Evaluate retrieval-augmented generation quality:
| Scorer | Class | Description |
|---|---|---|
faithfulness | FaithfulnessScorer | Does the response stay faithful to retrieved context? |
answer_relevancy | AnswerRelevancyScorer | Is the response relevant to the query? |
context_precision | ContextPrecisionScorer | How precise is the retrieved context? |
context_recall | ContextRecallScorer | Does the context contain all needed information? |
hallucination | HallucinationScorer | Does the response contain hallucinated information? |
Safety Scorers
Detect harmful or inappropriate content:
| Scorer | Class | Description |
|---|---|---|
toxicity | ToxicityScorer | Offensive or harmful language |
bias | BiasScorer | Unfair or prejudiced content |
pii_leakage | PIILeakageScorer | Personal identifiable information leakage |
Agent Scorers
Evaluate agent behavior:
| Scorer | Class | Description |
|---|---|---|
tool_correctness | ToolCorrectnessScorer | Did the agent select appropriate tools? |
task_completion | TaskCompletionScorer | Did the agent complete the task? |
goal_accuracy | GoalAccuracyScorer | How accurately did the agent achieve the goal? |
General Scorers
General quality metrics:
| Scorer | Class | Description |
|---|---|---|
g_eval | GEvalScorer | General evaluation (coherence, structure, quality) |
factuality | FactualityScorer | Is the response factually correct? |
ScorerInput Fields
All scorers accept a ScorerInput dataclass. Each scorer uses a subset of these fields based on its required_fields.
| Field | Type | Description |
|---|---|---|
trace_id | str | Unique identifier for the trace being evaluated (required) |
span_id | str | None | Optional span identifier within the trace |
question | str | None | The user's question/query (RAG scorers) |
answer | str | None | The AI's response/answer to evaluate (RAG scorers) |
contexts | list[str] | List of context passages retrieved for RAG |
ground_truth | str | None | The expected/correct answer for comparison |
expected_tools | list[str] | List of tool names the agent should have used |
used_tools | list[str] | List of tool names the agent actually used |
tool_calls | list[dict] | Detailed tool call information with parameters |
task_description | str | None | Description of the task assigned to the agent |
goal | str | None | The goal the agent was trying to achieve |
output_text | str | None | Generic output text to evaluate |
input_text | str | None | Generic input text for context |
custom_criteria | str | None | User-defined evaluation criteria (G-Eval) |
evaluation_steps | list[str] | Steps to follow during evaluation (G-Eval) |
metadata | dict | Additional key-value metadata |
Required Fields by Scorer
| Scorer | Required Fields |
|---|---|
faithfulness | answer, contexts |
answer_relevancy | question, answer |
context_precision | question, contexts |
context_recall | contexts, ground_truth |
hallucination | answer, contexts |
toxicity | output_text |
bias | output_text |
pii_leakage | output_text |
tool_correctness | (none required, uses expected_tools and used_tools) |
task_completion | task_description, output_text |
goal_accuracy | goal, output_text |
g_eval | output_text |
factuality | output_text |
Enabling Evaluations
Automatic Evaluation
Enable for all traces:
risicare.init(
api_key="rsk-...",
project_id="proj-...",
evaluations={
"scorers": ["faithfulness", "toxicity", "g_eval"],
"sample_rate": 0.1 # Evaluate 10% of traces
}
)API Trigger
Evaluate specific traces via API:
curl -X POST https://app.risicare.ai/v1/evaluations \
-H "Authorization: Bearer rsk-..." \
-H "Content-Type: application/json" \
-d '{
"trace_id": "abc123",
"scorers": ["faithfulness", "answer_relevancy", "toxicity"]
}'Dashboard
Trigger evaluations from the trace detail view in the dashboard.
Evaluation Results
Results include:
{
"trace_id": "abc123",
"evaluations": [
{
"scorer": "faithfulness",
"score": 0.92,
"passed": true,
"reasoning": "Response accurately reflects the retrieved context..."
},
{
"scorer": "toxicity",
"score": 0.01,
"passed": true,
"reasoning": "No toxic content detected."
}
]
}Thresholds
Configure pass/fail thresholds:
risicare.init(
evaluations={
"scorers": ["faithfulness", "toxicity"],
"thresholds": {
"faithfulness": 0.8, # Must score >= 0.8
"toxicity": 0.1 # Must score <= 0.1
}
}
)Alerts
Trigger alerts on evaluation failures:
risicare.init(
evaluations={
"scorers": ["toxicity"],
"alerts": {
"toxicity": {
"threshold": 0.3,
"channel": "slack",
"webhook": "https://hooks.slack.com/..."
}
}
}
)