Skip to content

Evaluations

Evaluations execute a task over a dataset and emit structured events, samples, and a final result.

  • Evaluation orchestrates execution of a task against a dataset.
  • @dn.evaluation wraps a task into an Evaluation.
  • EvalEvent (EvalStart, EvalSample, EvalEnd) streams progress.
  • Sample holds per-row input/output/metrics.
  • EvalResult aggregates metrics, pass/fail stats, and stop reasons.
import dreadnode as dn
from dreadnode.scorers import contains
@dn.evaluation(
dataset=[
{"question": "What is Dreadnode?", "expected": "agent platform"},
{"question": "What is an evaluation?", "expected": "dataset-driven"},
],
scorers=[contains("agent platform")],
assert_scores=["contains"],
concurrency=4,
max_errors=2,
)
async def answer(question: str, expected: str) -> str:
return f"{question} -> {expected}"
result: dn.EvalResult = await answer.run()
print(result.pass_rate, len(result.samples))

Load from a dataset file with preprocessing

Section titled “Load from a dataset file with preprocessing”

dataset_file accepts JSONL, CSV, JSON, or YAML. Use preprocessor to normalize data before scoring, and dataset_input_mapping to align dataset keys with task params.

from pathlib import Path
import dreadnode as dn
def normalize(rows: list[dict[str, str]]) -> list[dict[str, str]]:
return [{"prompt": row["prompt"].strip()} for row in rows]
evaluation = dn.Evaluation(
task="my_project.tasks.generate_answer",
dataset_file=Path("data/eval.jsonl"),
dataset_input_mapping={"prompt": "question"},
preprocessor=normalize,
concurrency=8,
)
result = await evaluation.run()
import dreadnode as dn
async with evaluation.stream() as events:
async for event in events:
if isinstance(event, dn.EvalEvent):
print(type(event).__name__)