Workflow Engine

How multi-agent workflows are orchestrated — step execution modes, retry logic, variable passing, PII boundaries, and memory isolation.

Workflow Execution Overview

A workflow is a named sequence of steps where each step routes a task to a specific agent. The engine supports sequential pipelines, parallel fan-out, conditional branching, and iterative loops.

Step Execution Modes

Sequential (default)

Steps execute one after another. Each step receives the previous step's output as {{input}}.

FanOut

Consecutive FanOut steps execute in parallel. All start with the same input (the last sequential step's output).

Implementation: all FanOut steps are gathered into a batch and executed via futures::future::join_all. If any fails, the entire workflow fails (no partial completion in fan-out).

Collect

Aggregates all preceding FanOut step outputs into a single input for the next step. Outputs are joined with \n\n---\n\n separators.

Conditional

A step that only executes if the previous step's output contains a specific string (case-insensitive match).

{
  "mode": {"Conditional": {"condition": "needs_review"}},
  "prompt_template": "Review this: {{input}}"
}

If the condition is not met, the step is skipped and execution continues with the next step.

Loop

Repeats a step until:

The output contains the until string (case-insensitive), OR
max_iterations is reached

{
  "mode": {"Loop": {"max_iterations": 5, "until": "DONE"}},
  "prompt_template": "Refine this draft: {{input}}"
}

Each iteration's output becomes the next iteration's input. Step results are logged as "step_name (iter N)".

Error Handling

Each step declares an error_mode that controls behavior on failure:

ErrorMode::Fail (default)

The step must succeed. On failure or timeout, the entire workflow is marked Failed and execution stops.

ErrorMode::Skip

On failure or timeout, the step is skipped and execution continues with the next step. A warning is logged.

ErrorMode::Retry

The step is retried up to max_retries times before the workflow fails:

On each retry:

The same prompt is sent to the same agent
A warning is logged with the attempt number and error
No backoff delay between retries (immediate retry)

Template Variable System

Steps can store their output in named variables and reference them later:

{
  "steps": [
    {
      "name": "extract",
      "prompt_template": "Extract key data from: {{input}}",
      "output_var": "extracted_data"
    },
    {
      "name": "validate",
      "prompt_template": "Validate: {{extracted_data}}",
      "output_var": "validation_result"
    },
    {
      "name": "decide",
      "prompt_template": "Given data: {{extracted_data}} and validation: {{validation_result}}, make a decision"
    }
  ]
}

Variable expansion:

{{input}} — the current step input (previous step's output for sequential, or initial input for the first step)
{{var_name}} — any previously stored output_var
Variables persist for the entire workflow run
Undefined variables are left as-is (not replaced)

PII Boundary Enforcement

At each step boundary, data passes through the unified carrier PII policy (ADR-049) — not a per-agent trust policy. There is no agent-level PII configuration; the same operator-owned carrier-pii-policy.toml governs every step boundary in every workflow:

How It Works

Detection — the outbound data is scanned for PII spans (SSNs, policy numbers, names, etc.)
Tokenization — detected spans are replaced with opaque tokens (e.g., [PII_SSN_001]). A token map records the original values. This is the unconditional substrate — it is not a configurable choice.
Delivery — the tokenized text is sent to the next step's agent
Hydration — restoring a token to its real value requires an explicit rule in the carrier policy file (LLM dimension: never for a cloud model; tool dimension: only a named (destination, PII-type) hydrate entry). There is no per-agent override anywhere in this path.

This is a fully implemented, actively-enforced pipeline (detector → tokenizer → hydrator, crates/platform/hoziron-core/src/pii/), not a placeholder — see pii-data-protection.md for the full policy model.

Outbound Egress From a Workflow Step (ADR-042, ADR-047)

When a step's agent makes an outbound tool call to a carrier system (not just an inter-agent handoff), that call crosses the same CoreOutboundMediator seam as any other agent invocation, with MediationContext carrying workflow_run_id, business_identity, and step_index so a multi-write value-event (e.g. an endorsement writing both a policy change and a premium adjustment) can be reconciled as one signature rather than double-counted. See ../architecture/data-flow.md for the full sequence and provider-routing.md/pii-data-protection.md for the mediator itself.

A mediation rejection or a hydration failure mid-workflow does not silently drop the run — it escalates the run to RunState::Escalated (see Workflow Run Lifecycle below) rather than either completing it dishonestly or discarding the work. An escalated run is, by the same rule, an incomplete value-event and draws no billing credit until it is completed or explicitly resolved.

Billable Value Events (ADR-043, ADR-046)

A workflow's outbound writes to licence-specified carrier targets are the unit Hoziron charges for — never workflow completion itself, and never a read. Key properties relevant to workflow authors and operators:

Detection is at the mediation seam, not the workflow engine. The engine has no billing logic of its own; classification (write vs. read, which target, what it costs) is licence-declared and core-owned, keyed only on the resolved MCP/REST/SOAP target — never on a workflow's name, step count, or shape. Reshaping a workflow does not change what it bills.
A value-event fires only on a complete write signature. Some billable events (e.g. an endorsement) require multiple writes across targets to count as one occurred event; a partial signature draws nothing — same principle as the escalation behavior above.
Deduplication is per (operation_type, business_identity, billing_window) — a repeat write of the same operation to the same business entity within the same calendar month draws once.

Observability: Tracing Spans

Workflow execution emits structured tracing spans nested by construction, verified by span-recorder tests (Issue #665):

The workflow_run span lives on the kernel's WorkflowEngine::execute_run_from, not on hoziron-core's emit_run_started, so it wraps the run's actual duration. workflow_step, agent_invocation, and llm_call spans carry workflow_id, mode, duration_ms, and token-usage fields for per-step performance and cost analysis.

Memory Isolation

Each agent participating in a workflow maintains its own separate memory scope:

The kernel's BoundMemoryHandle is bound to one agent's ID at construction, with no method accepting a free agent_id parameter
Cross-agent memory access is structurally impossible through the handle, not merely rejected at runtime
Data passes between agents only through step outputs (with PII tokenization)
This is enforced at the memory layer, not the workflow layer — even if an agent somehow obtains another agent's ID, there is no operation through which it could use that ID to reach another agent's memory

Validation at Step Boundaries

Before each step executes, the engine validates:

Target agent has its own metadata entry (owns a memory scope)
Source and target agents are distinct (prevents self-referential memory confusion)

Timeout Enforcement

Each step has a configurable timeout (1–3600 seconds, default 120):

The step execution is wrapped in tokio::time::timeout
On timeout: behavior depends on error_mode (Fail, Skip, or Retry)
FanOut steps each have their own individual timeouts running in parallel

Workflow Run Lifecycle

Active (non-terminal) states are Pending, Running, Suspended, and Escalated — all four count toward the drain/admission machinery. Escalated is distinct from Failed: it means a downstream write was stranded mid-flight (not cleanly failed and not silently completed) and needs operator attention or workflow-level resolution before it can be counted as either a billable value-event or a terminal failure.

Workflow `suspend` Means Drain, Not Pause (ADR-056)

Per the unified activation gate, Runnable::suspend on a Workflow is not an ungated instant halt — it stops admitting new runs and lets in-flight runs complete, auto-transitioning the workflow definition itself to Suspended once the active-run count reaches zero. There is no Terminated equivalent for a workflow; its live state is its run count, not an instance identity.

Run Retention

The engine retains up to 200 workflow runs. When exceeded, oldest completed/failed runs are evicted (LRU by started_at).

Agent Resolution

Steps reference agents by ID or name:

{"ById": "550e8400-e29b-41d4-a716-446655440000"}
{"ByName": "claims-intake-agent"}

Resolution happens at run start:

ById — validates the UUID exists in the agent registry
ByName — scans the registry for the first agent matching the name

If an agent is not found, the workflow fails before executing any steps.

Invocation Source Metadata

Each workflow step is invoked through the unified invocation layer with InvocationSource::Workflow metadata:

InvocationSource::Workflow {
    workflow_id: "a1b2c3d4-...",
    step_index: 2,
    upstream_agent_id: Some(AgentId("550e8400-..."))
}

This metadata flows through rate limiting, telemetry, and audit — each step invocation is separately tracked.

Related:

../architecture/data-flow.md — the mediated egress + billing sequence in full
pii-data-protection.md, provider-routing.md, invocation-model.md
docs/decisions/042, 043, 046, 047, 049, 056