Cerebras
Auto-instrument Cerebras for hardware-accelerated inference.
Risicare automatically instruments the Cerebras SDK for wafer-scale inference.
Installation
pip install risicare cerebras-cloud-sdkAuto-Instrumentation
import risicare
from cerebras.cloud.sdk import Cerebras
risicare.init()
client = Cerebras()
# Automatically traced
response = client.chat.completions.create(
model="llama3.1-70b",
messages=[{"role": "user", "content": "Hello!"}]
)Captured Attributes
| Attribute | Description |
|---|---|
gen_ai.system | cerebras |
gen_ai.request.model | Requested model name |
gen_ai.response.model | Model name returned by API |
gen_ai.response.id | Response ID |
gen_ai.request.temperature | Sampling temperature |
gen_ai.request.max_tokens | Max output tokens |
gen_ai.request.stream | Whether streaming was requested |
gen_ai.request.has_tools | Whether tools were provided |
gen_ai.usage.prompt_tokens | Input tokens |
gen_ai.usage.completion_tokens | Output tokens |
gen_ai.usage.total_tokens | Total tokens |
gen_ai.completion.tool_calls | Number of tool calls made |
gen_ai.completion.finish_reason | Stop reason |
gen_ai.latency_ms | Request latency in milliseconds |
cerebras.queue_time | Queue wait time |
cerebras.prompt_time | Prompt processing time |
cerebras.completion_time | Completion generation time |
Cerebras provides detailed timing breakdown via the time_info object, showing queue wait time, prompt processing time, and completion generation time.
Streaming
stream = client.chat.completions.create(
model="llama3.1-70b",
messages=[{"role": "user", "content": "Write a story"}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")Supported Models
| Model | Description |
|---|---|
llama3.1-70b | Llama 3.1 70B |
llama3.1-8b | Llama 3.1 8B |