Skip to main content
GitHub

Scorers

Built-in and custom scoring for LLM evaluation.

Risicare provides two ways to score your traces:

  1. Built-in scorers — 13 pre-configured LLM-based evaluators that run server-side when you trigger an evaluation
  2. Custom scores — Use risicare.score() to record any metric from your own code

Custom Scores with risicare.score()

The simplest way to add scores to your traces. No extra packages needed — it's built into the SDK you already have.

import risicare
 
risicare.init(api_key="rsk-your-api-key")
 
# Score a trace with any custom metric
risicare.score(
    trace_id="trace-abc123",
    name="sql_valid",
    value=1.0,
    comment="Query executed without errors"
)

JavaScript / TypeScript:

import { init, score } from 'risicare';
 
init({ apiKey: 'rsk-your-api-key' });
 
score('trace-abc123', 'sql_valid', 1.0, {
    comment: 'Query executed without errors',
});

Scoring Inside a Trace

import risicare
 
@risicare.trace
def my_pipeline(query):
    result = llm.invoke(query)
 
    # Score this trace based on custom logic
    trace_id = risicare.get_current_trace_id()
    if trace_id:
        is_valid = validate_output(result)
        risicare.score(
            trace_id=trace_id,
            name="output_valid",
            value=1.0 if is_valid else 0.0
        )
 
    return result

Parameters

ParameterTypeRequiredDefaultDescription
trace_idstrYesThe trace to score
namestrYesScore name (e.g., "accuracy", "user_satisfaction")
valuefloatYesScore value
span_idstrNonullSpecific span within the trace
commentstrNonullHuman-readable explanation

Scoring via REST API

You can also create scores via HTTP:

curl -X POST "https://app.risicare.ai/api/v1/scores" \
  -H "Authorization: Bearer rsk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "trace_id": "trace-abc123",
    "name": "accuracy",
    "score": 0.95,
    "comment": "Response matched expected output",
    "source": "api"
  }'

Built-in Scorers

Risicare Evaluations dashboard showing 20 runs, 13 available scorers across RAG/Safety/Agent/General categories, and completed evaluation results

When you create an evaluation via the API or dashboard, you specify which scorers to run using the criteria field. The Risicare server runs these scorers automatically — you don't need to install any extra packages.

Triggering Built-in Scorers

curl -X POST "https://app.risicare.ai/api/v1/evaluations" \
  -H "Authorization: Bearer rsk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Quality check",
    "evaluation_type": "llm_judge",
    "trace_ids": ["trace-abc123"],
    "criteria": ["faithfulness", "toxicity"]
  }'

Or from the dashboard: Evaluations → New Evaluation, select traces, and choose scorers.

Server-side execution

Built-in scorers run on the Risicare server using LLM-as-judge. You don't need to install any additional packages or provide your own LLM API key for built-in scorers. Evaluations are queued (HTTP 202) and processed asynchronously by a worker.

Fully Verified (10 scorers)

These scorers work immediately with standard trace data:

ScorerCategoryWhat it evaluatesScore direction
faithfulnessRAGIs the answer grounded in the provided context?Higher is better
answer_relevancyRAGDoes the answer address the question?Higher is better
context_precisionRAGIs the retrieved context relevant?Higher is better
hallucinationRAGDoes the answer contain fabricated claims?Lower is better
toxicitySafetyIs the content toxic, harmful, or offensive?Lower is better
biasSafetyDoes the output show demographic or cultural bias?Lower is better
pii_leakageSafetyDoes the output leak personal identifiable information?Lower is better
task_completionAgentDid the agent complete the requested task?Higher is better
tool_correctnessAgentWere the right tools used with correct parameters?Higher is better
factualityGeneralAre factual claims in the output accurate?Higher is better

Require Additional Configuration (3 scorers)

These scorers work correctly but need specific input fields:

ScorerCategoryRequiresWhy
context_recallRAGground_truth field in trace dataCompares output against a reference answer
goal_accuracyAgentgoal field in evaluation configMeasures whether agent achieved a specific goal
g_evalGeneralCustom criteria in scorer configConfigurable evaluation framework — needs user-defined criteria

Next Steps