Skip to content

Training

The Python SDK now includes typed models and ApiClient methods for the hosted training control plane.

Hosted training methods live on dreadnode.app.api.client.ApiClient:

  • create_training_job()
  • list_training_jobs()
  • get_training_job()
  • cancel_training_job()
  • retry_training_job()
  • list_training_job_logs()
  • get_training_job_artifacts()

Use explicit request types for each backend and trainer combination:

from dreadnode.app.api.client import ApiClient
from dreadnode.app.api.models import (
CapabilityRef,
CreateTinkerSFTJobRequest,
DatasetRef,
TinkerSFTJobConfig,
)
client = ApiClient("https://api.example.com", api_key="dn_...")
job = client.create_training_job(
"acme",
"default",
CreateTinkerSFTJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="assistant", version="1.2.0"),
config=TinkerSFTJobConfig(
dataset_ref=DatasetRef(name="acme/default", version="train"),
batch_size=8,
lora_rank=16,
),
),
)

For Worlds trajectory datasets, use trajectory_dataset_refs instead of a plain SFT dataset:

job = client.create_training_job(
"acme",
"default",
CreateTinkerSFTJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="assistant", version="1.2.0"),
config=TinkerSFTJobConfig(
trajectory_dataset_refs=[
DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"),
DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"),
],
batch_size=8,
),
),
)

For prompt-dataset RL jobs, prompt_dataset_ref is nested under the RL config payload:

from dreadnode.app.api.models import (
CapabilityRef,
CreateTinkerRLJobRequest,
DatasetRef,
TinkerRLJobConfig,
)
request = CreateTinkerRLJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="web-agent", version="2.0.1"),
config=TinkerRLJobConfig(
algorithm="importance_sampling",
task_ref="security-mutillidae-sqli-login-bypass",
prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"),
reward_recipe={"name": "task_verifier_v1"},
execution_mode="fully_async",
prompt_split="train",
steps=10,
max_steps_off_policy=3,
num_rollouts=32,
lora_rank=16,
max_new_tokens=128,
temperature=0.1,
stop=["</answer>"],
),
)

For Worlds-driven offline RL, use trajectory_dataset_refs instead. In this mode the sandbox runtime converts each published trajectory into assistant-step prompt rows and defaults to trajectory_imitation_v1 when no explicit reward recipe is supplied. The published Worlds dataset now carries trajectory outcome metadata, so matched steps inherit the recorded trajectory reward weight instead of using a flat imitation score:

request = CreateTinkerRLJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
config=TinkerRLJobConfig(
algorithm="importance_sampling",
trajectory_dataset_refs=[
DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"),
DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"),
],
steps=10,
num_rollouts=32,
lora_rank=16,
),
)

For Worlds-first RL, use a live manifest plus a runtime id. The control plane samples native-agent Worlds trajectories first, publishes them as a dataset, and then runs the existing offline/async RL runtime against that published dataset:

request = CreateTinkerRLJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
config=TinkerRLJobConfig(
algorithm="importance_sampling",
world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11",
world_runtime_id="8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb",
world_agent_name="operator",
world_goal="Escalate to Domain Admin in corp.local",
execution_mode="fully_async",
max_steps_off_policy=3,
num_rollouts=4,
max_turns=8,
),
)

When world_runtime_id is present, hosted RL treats Worlds-published native-agent datasets as the primary input path. The selected runtime and capability generate trajectories inside Worlds, and the training sandbox then reuses the same offline/async RL runtime used for published trajectory datasets.

If you need live rollout-time reward shaping, keep using world_reward. That preserves the older HTTP-backed live-rollout bridge instead of the new dataset-primary path:

from dreadnode.app.api.models import WorldRewardPolicy
request = CreateTinkerRLJobRequest(
model="meta-llama/Llama-3.1-8B-Instruct",
capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
config=TinkerRLJobConfig(
algorithm="importance_sampling",
world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11",
world_goal="Escalate to Domain Admin in corp.local",
world_reward=WorldRewardPolicy(
name="goal_only_v1",
params={"success_reward": 2.0},
),
),
)

The API validates refs on submission:

  • dataset refs are structured objects with explicit name and version
  • task refs can use task-name for the latest visible task or task-name@1.2.0 for an explicit version

Current hosted SFT behavior:

  • datasets can provide full messages conversations or simple prompt/answer rows
  • Worlds trajectory datasets can be supplied through trajectory_dataset_refs and are converted from ATIF into SFT conversations inside the sandbox runtime
  • capability prompts are injected as the system scaffold before tokenization
  • eval runs if eval_dataset_ref is supplied

Current hosted RL limitations:

  • task_verifier_v1 currently supports flag-based verification only
  • Tinker RL supports:
    • prompt-dataset RL
    • offline Worlds trajectory RL
    • runtime-driven Worlds manifest sampling into published native-agent datasets
    • the older live Worlds manifest bridge when world_reward is supplied
  • hosted Tinker RL now supports:
    • execution_mode="sync"
    • execution_mode="one_step_off_async"
    • execution_mode="fully_async"
  • the async modes are rollout-group schedulers:
    • one_step_off_async keeps one rollout group in flight and bounds staleness to one step
    • fully_async widens the queue to bounded multi-group async training using max_steps_off_policy
    • neither mode is a partial-rollout continuation runtime yet
  • the primary Worlds RL path now pre-samples native-agent trajectories from the selected manifest and runtime, then trains from the published dataset
  • live Worlds RL over HTTP is now the compatibility path used when you explicitly request a world_reward

For self-hosted workers, set TINKER_BASE_URL to point the executor at a non-default Tinker service endpoint.

For agentic Worlds data collection, the SDK now includes a concrete rollout helper under dreadnode.training.rollouts. It wraps a normal SDK Agent, attaches reward/trace hooks, and returns a RolloutResult that can seed a later RL loop.

from dreadnode import Agent
from dreadnode.training.rollouts import (
CompositeWorldsRewardShaper,
HostDiscoveryRewardShaper,
ReasoningTraceRewardShaper,
TerminalStateRewardShaper,
run_worlds_agent_rollout,
)
agent = Agent(
model="openai/gpt-5",
instructions="Enumerate the AD environment and escalate toward Domain Admin.",
tools=[...],
)
result = await run_worlds_agent_rollout(
agent,
"Enumerate the domain controller and gather credentials.",
reward_shaper=CompositeWorldsRewardShaper(
ReasoningTraceRewardShaper(value=0.05),
HostDiscoveryRewardShaper(value=0.25),
TerminalStateRewardShaper(success_reward=2.0),
),
)
print(result.final_reward)
print(result.metadata["turns"][0]["reasoning_content"])

This is still a prototype surface:

  • rewards are attached through agent hooks and can be defined as composable shapers in dreadnode.training.rollouts.worlds
  • the result is built from the SDK Agent event stream, not the current algorithmic Worlds walker
  • it is intended as the basis for a later sandbox-backed Worlds RL loop

The SDK now includes sandbox-facing payload and result contracts in dreadnode.training.jobs, plus a module entrypoint that a training sandbox can run directly:

Terminal window
python -m dreadnode.training.jobs \
--payload /tmp/dreadnode-training/payloads/job-123.json \
--result /tmp/dreadnode-training/results/job-123.json

That boundary is intentionally narrow:

  • the API resolves refs and writes the payload JSON
  • the sandbox runtime reads DREADNODE_* and TINKER_* env vars
  • the SDK runtime executes the job and writes a structured result JSON back out
  • the current job runtime supports:
    • hosted Tinker SFT
    • prompt-dataset Tinker RL in synchronous mode
    • prompt-dataset Tinker RL in one-step-off async mode
    • online Worlds-manifest Tinker RL in synchronous or one-step-off async mode

The SDK also now includes reusable ETL helpers for converting Worlds ATIF trajectory datasets into chat-template style examples:

from pathlib import Path
from dreadnode.training.etl import (
convert_atif_trajectories_to_chat_template,
load_atif_trajectories_jsonl,
)
trajectories = load_atif_trajectories_jsonl(Path("trajectories.atif.jsonl"))
examples = convert_atif_trajectories_to_chat_template(
trajectories,
tool_mode="command",
)
print(examples[0]["messages"][0]["role"])
print(examples[0]["tools"][0]["function"]["name"])

This is the reusable library path that hosted SFT jobs now use for published Worlds trajectory datasets. It replaces the old script-shaped conversion logic in dreadnode.training.utils.

For hosted SFT preparation, the SDK also exposes reusable normalization helpers under dreadnode.training.etl.sft for turning dataset records into chat conversations with an optional injected system prompt.

Local training helpers live under dreadnode.training and wrap Ray-based trainers. These are useful for iterating on reward functions or fine-tuning on local hardware.

from dreadnode.training import train_dpo, train_grpo, train_ppo, train_sft
def reward_fn(prompts: list[str], completions: list[str]) -> list[float]:
return [0.0 for _ in completions]
train_sft({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])
train_dpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])
train_grpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)
train_ppo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)

For more control, use the underlying trainers directly:

from dreadnode.training import RayGRPOConfig, RayGRPOTrainer
config = RayGRPOConfig(model_name="meta-llama/Llama-3.1-8B-Instruct")
trainer = RayGRPOTrainer(config)
trainer.train(prompts=["hello"], reward_fn=lambda prompts, completions: [0.0])

The SDK also exposes trainer classes for managed execution on cloud backends:

  • AnyscaleTrainer
  • AzureMLTrainer
  • SageMakerTrainer
  • PrimeTrainer

Import them from dreadnode.training alongside their corresponding config objects.