Skip to main content
GitHub

Core Concepts

Understand the key concepts in Risicare observability.

This guide explains the core concepts you'll encounter when using Risicare.

Traces and Spans

Traces

A trace represents a complete execution flow through your agent. It starts when your agent receives input and ends when it produces output.

Trace: "Answer user question about weather"
├── Span: parse_input (2ms)
├── Span: llm_call to gpt-4o (1.2s)
├── Span: tool_call to weather_api (300ms)
└── Span: format_response (5ms)

Spans

A span represents a single unit of work within a trace. Spans have:

  • Name: What operation this span represents
  • Kind: The type of span (LLM_CALL, TOOL_CALL, AGENT, etc.)
  • Timing: Start time and duration
  • Attributes: Key-value metadata
  • Events: Named events that occurred during the span
  • Status: OK or ERROR

Span Hierarchy

Spans form a tree structure with parent-child relationships:

with start_span("process_request") as parent:
    with start_span("call_llm") as child:
        # child.parent_id == parent.span_id
        pass

Agents

An agent is a logical component that makes decisions. In Risicare, agents are identified by:

  • ID: Unique identifier (auto-generated or explicit)
  • Name: Human-readable name (e.g., "planner", "researcher")
  • Role: The agent's role (orchestrator, worker, reviewer)
  • Type: The agent framework/pattern used
@agent(name="planner", role="orchestrator")
def plan_task(objective):
    # All spans inside this function are associated with this agent
    pass

Sessions

A session groups related traces from the same user interaction. Use sessions to:

  • Track multi-turn conversations
  • Group related agent executions
  • Analyze user journeys
with session_context(session_id="user-123-session"):
    # All traces here belong to this session
    result1 = agent.run("First request")
    result2 = agent.run("Follow-up request")

Semantic Phases

Risicare tracks semantic phases to understand agent decision-making:

PhaseDescriptionExample
THINKReasoning and planningAnalyzing the problem
DECIDEMaking a decisionChoosing which tool to use
ACTTaking an actionCalling an API
OBSERVEReading stateChecking memory
@trace_think
def analyze_problem(context):
    """This is a THINK phase - reasoning about the problem"""
    pass
 
@trace_decide
def choose_tool(options):
    """This is a DECIDE phase - selecting an action"""
    pass
 
@trace_act
def execute_tool(tool, args):
    """This is an ACT phase - performing the action"""
    pass

Context Propagation

Risicare automatically propagates context through your code:

  • Thread-safe: Uses Python contextvars for thread isolation
  • Async-safe: Works correctly with asyncio
  • Cross-process: Supports W3C Trace Context for distributed tracing

Automatic Propagation

@agent(name="parent")
async def parent_agent():
    # Context automatically propagates to child calls
    await child_agent()  # Inherits trace context
 
@agent(name="child")
async def child_agent():
    # This agent's spans are children of parent_agent's span
    pass

Manual Context

# Extract context for passing to another system
context = get_trace_context()
 
# Restore context in another thread/process
with restore_trace_context(context):
    # Spans created here continue the trace
    pass

Error Taxonomy

When errors occur, Risicare classifies them using a 10-module taxonomy:

ModuleWhat It Covers
PERCEPTIONInput parsing, validation
REASONINGLogic errors, hallucinations
TOOLTool execution failures
MEMORYState management issues
OUTPUTResponse formatting
COORDINATIONWorkflow problems
COMMUNICATIONInter-agent messages
ORCHESTRATIONAgent lifecycle
CONSENSUSMulti-agent agreement
RESOURCESResource contention

Each module contains categories, and each category contains specific error codes:

TOOL.EXECUTION.TIMEOUT
 │      │        └── Specific error code
 │      └── Category (EXECUTION)
 └── Module (TOOL)

The Self-Healing Pipeline

When an error is detected, Risicare runs a 4-stage diagnosis pipeline:

Error Detected
     ↓
1. Context Extraction
   Extract relevant spans, messages, and state
     ↓
2. Taxonomy Classification
   Classify using gpt-4o-mini (fast)
     ↓
3. Root Cause Analysis
   Deep analysis using gpt-4o (thorough)
     ↓
4. Fix Suggestion
   Generate fix configurations

Fixes are then validated through hypothesis testing and deployed via A/B testing.

Next Steps