Skip to content

Evaluations

Evaluations are the durable execution control plane for judged task runs on Dreadnode.

An evaluation answers: How well does this execution configuration perform across a set of task definitions?

You provide:

  • a dataset or list of task names
  • execution settings such as model, concurrency, and timeout
  • runtime configuration
  • optional secret_ids to inject your selected user secrets into both the runtime sandbox and the task environment sandbox for each evaluation item

The platform stores authoritative job and item state in Postgres. ClickHouse remains the analytics plane for traces, scores, timing, and sample-level telemetry. For task benchmarks, that telemetry comes from the runtime sandbox that hosts the agent loop. Per-item chat transcripts are stored as workspace-scoped artifacts in object storage, with a small pointer kept on the evaluation item metadata.

Each dataset row becomes one evaluation item.

Each item:

  • references one task definition
  • provisions a task environment sandbox plus a runtime sandbox
  • runs one judged execution
  • records pass/fail or infrastructure outcome

Tasks used in evaluations must define a verification config. If a task has no verification step, evaluation creation is rejected before any worker claims or provisions the item.

Typical item states are:

  • queued
  • claiming
  • provisioning
  • agent_running
  • agent_finished
  • verifying
  • passed, failed, timed_out, cancelled, or infra_error

The actual compute units are sandboxes. Evaluation items link to the sandbox rows used for:

  • the runtime that hosts the agent loop
  • the task environment derived from the task build

That split matters:

  • Postgres tracks job and item lifecycle
  • the shared sandbox ledger tracks runtime state and cost
  • ClickHouse stores emitted telemetry and derived analytics

The item-to-sandbox lineage is captured in the EvaluationItemSandboxORM join table so you can trace which runtimes backed each evaluation item.

The current API still exposes the runtime pointer as agent_sandbox_id. That name is kept for compatibility; conceptually it is the runtime sandbox.

The evaluation item run_id is also the runtime-side trace locator. That lets the platform expose evaluation-scoped runtime trace views without making traces the source of truth for job state.

In local development, the API can run a small in-process evaluation runner so queued items begin executing without a separate worker deployment.

  • EVALUATION_IN_PROCESS_WORKER_ENABLED=true enables the runner explicitly.
  • In ENVIRONMENT=local, it is enabled by default.
  • In non-local environments, keep it disabled and run a dedicated evaluation worker process instead.

Evaluation jobs and items are workspace-scoped API resources:

  • POST /api/v1/org/{org}/ws/{workspace}/evaluation
  • GET /api/v1/org/{org}/ws/{workspace}/evaluations
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items/{item_id}
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/items/{item_id}/transcript
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/analytics
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/traces
  • GET /api/v1/org/{org}/ws/{workspace}/evaluation/{evaluation_id}/traces/items

The analytics endpoint follows the SDK evaluation result shape on purpose. It promotes total_samples, passed_count, failed_count, error_count, and pass_rate, then adds a richer analytics_snapshot with task breakdowns, runtime duration rollups, trace summary, and grouped errors.

From the SDK CLI, you can inspect an item transcript with dreadnode evaluation transcript <evaluation_id> <item_id>.