Ollama
Auto-instrument Ollama for local inference.
Risicare automatically instruments the Ollama SDK for local LLM inference.
Installation
pip install risicare ollamaNo API Key Required
Ollama runs locally, so no API key is needed. Just ensure Ollama is running on your machine.
Auto-Instrumentation
import risicare
import ollama
risicare.init()
# Automatically traced
response = ollama.chat(
model="llama3",
messages=[{"role": "user", "content": "Hello!"}]
)Captured Attributes
| Attribute | Description |
|---|---|
gen_ai.system | ollama |
gen_ai.request.model | Model name |
gen_ai.request.stream | Whether streaming was requested |
gen_ai.usage.prompt_tokens | Input tokens (from prompt_eval_count) |
gen_ai.usage.completion_tokens | Output tokens (from eval_count) |
gen_ai.usage.total_tokens | Total tokens |
gen_ai.latency_ms | Request latency in milliseconds |
Token Count Source
Token counts come from Ollama's prompt_eval_count and eval_count fields in the response. For the generate API, gen_ai.prompt.content is captured instead of message-level inputs.
Streaming
stream = ollama.chat(
model="llama3",
messages=[{"role": "user", "content": "Write a story"}],
stream=True
)
for chunk in stream:
print(chunk["message"]["content"], end="")Generate API
The generate function is also instrumented:
response = ollama.generate(
model="llama3",
prompt="Hello, how are you?"
)Async Support
import asyncio
from ollama import AsyncClient
async def main():
client = AsyncClient()
response = await client.chat(
model="llama3",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response["message"]["content"])
asyncio.run(main())Popular Models
| Model | Description |
|---|---|
llama3 | Llama 3 (default size) |
llama3:70b | Llama 3 70B |
mistral | Mistral 7B |
mixtral | Mixtral 8x7B |
codellama | Code Llama |
phi3 | Phi-3 |