Evaluations

Evaluations are the durable execution control plane for judged task runs on Dreadnode.

What an evaluation is

An evaluation answers: How well does this execution configuration perform across a set of task definitions?

You provide:

a dataset or list of task names
execution settings such as model, concurrency, and timeout
runtime configuration
optional secret_ids to inject your selected user secrets into both the runtime sandbox and the task environment sandbox for each evaluation item

The platform stores authoritative job and item state in Postgres. ClickHouse remains the analytics plane for traces, scores, timing, and sample-level telemetry. For task benchmarks, that telemetry comes from the runtime sandbox that hosts the agent loop. Per-item chat transcripts are stored as workspace-scoped artifacts in object storage, with a small pointer kept on the evaluation item metadata.

Execution model

Each dataset row becomes one evaluation item.

Each item:

references one task definition
provisions a task environment sandbox plus a runtime sandbox
runs one judged execution
records pass/fail or infrastructure outcome

Tasks used in evaluations must define a verification config. If a task has no verification step, evaluation creation is rejected before any worker claims or provisions the item.

Typical item states are:

queued
claiming
provisioning
agent_running
agent_finished
verifying
passed, failed, timed_out, cancelled, or infra_error

Task + Runtime

The actual compute units are sandboxes. Evaluation items link to the sandbox rows used for:

the runtime that hosts the agent loop
the task environment derived from the task build

That split matters:

Postgres tracks job and item lifecycle
the shared sandbox ledger tracks runtime state and cost
ClickHouse stores emitted telemetry and derived analytics

The item-to-sandbox lineage is captured in the EvaluationItemSandboxORM join table so you can trace which runtimes backed each evaluation item.

The current API still exposes the runtime pointer as agent_sandbox_id. That name is kept for compatibility; conceptually it is the runtime sandbox.

The evaluation item run_id is also the runtime-side trace locator. That lets the platform expose evaluation-scoped runtime trace views without making traces the source of truth for job state.

Local development

In local development, the API can run a small in-process evaluation runner so queued items begin executing without a separate worker deployment.

EVALUATION_IN_PROCESS_WORKER_ENABLED=true enables the runner explicitly.
In ENVIRONMENT=local, it is enabled by default.
In non-local environments, keep it disabled and run a dedicated evaluation worker process instead.

API surface

Evaluation jobs and items are workspace-scoped API resources:

POST /api/v1/org/{org}/ws/{workspace}/evaluation
GET /api/v1/org/{org}/ws/{workspace}/evaluations
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items/{item_id}
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items/{item_id}/transcript
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/analytics
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/traces
GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/traces/items

The analytics endpoint follows the SDK evaluation result shape on purpose. It promotes total_samples, passed_count, failed_count, error_count, and pass_rate, then adds a richer analytics_snapshot with task breakdowns, runtime duration rollups, trace summary, and grouped errors.

From the SDK CLI, you can inspect an item transcript with dreadnode evaluation transcript <evaluation_id> <item_id>.