Evaluations

Evaluations execute a task over a dataset and emit structured events, samples, and a final result.

Key concepts

Evaluation orchestrates execution of a task against a dataset.
@dn.evaluation wraps a task into an Evaluation.
EvalEvent (EvalStart, EvalSample, EvalEnd) streams progress.
Sample holds per-row input/output/metrics.
EvalResult aggregates metrics, pass/fail stats, and stop reasons.

Create an evaluation with the decorator

import dreadnode as dn
from dreadnode.scorers import contains

@dn.evaluation(
    dataset=[
        {"question": "What is Dreadnode?", "expected": "agent platform"},
        {"question": "What is an evaluation?", "expected": "dataset-driven"},
    ],
    scorers=[contains("agent platform")],
    assert_scores=["contains"],
    concurrency=4,
    max_errors=2,
)
async def answer(question: str, expected: str) -> str:
    return f"{question} -> {expected}"

result: dn.EvalResult = await answer.run()
print(result.pass_rate, len(result.samples))

Load from a dataset file with preprocessing

dataset_file accepts JSONL, CSV, JSON, or YAML. Use preprocessor to normalize data before scoring, and dataset_input_mapping to align dataset keys with task params.

from pathlib import Path
import dreadnode as dn

def normalize(rows: list[dict[str, str]]) -> list[dict[str, str]]:
    return [{"prompt": row["prompt"].strip()} for row in rows]

evaluation = dn.Evaluation(
    task="my_project.tasks.generate_answer",
    dataset_file=Path("data/eval.jsonl"),
    dataset_input_mapping={"prompt": "question"},
    preprocessor=normalize,
    concurrency=8,
)

result = await evaluation.run()

Stream events during execution

import dreadnode as dn

async with evaluation.stream() as events:
    async for event in events:
        if isinstance(event, dn.EvalEvent):
            print(type(event).__name__)