Workflow Error Handling

What you'll accomplish: Configure error modes (Fail, Skip, Retry) and timeout behavior for workflow steps.

Error modes

Each step declares an error_mode that controls behavior on failure:

Fail (default)

The step must succeed. On failure or timeout, the entire workflow is marked Failed.

{
  "name": "critical-step",
  "error_mode": "Fail",
  "agent": {"ByName": "processor"},
  "prompt_template": "Process: {{input}}"
}

Skip

On failure or timeout, the step is skipped and execution continues with the next step:

{
  "name": "optional-enrichment",
  "error_mode": "Skip",
  "agent": {"ByName": "enricher"},
  "prompt_template": "Enrich: {{input}}"
}

A warning is logged but the workflow proceeds.

Retry

The step is retried up to max_retries times before the workflow fails:

{
  "name": "flaky-api-call",
  "error_mode": {"Retry": {"max_retries": 3}},
  "agent": {"ByName": "api-caller"},
  "prompt_template": "Call API: {{input}}"
}

Retries are immediate (no backoff delay). Each retry sends the same prompt to the same agent.

Timeout configuration

Each step has a configurable timeout (1–3600 seconds):

{
  "name": "analysis",
  "timeout_secs": 300,
  "agent": {"ByName": "analyst"},
  "prompt_template": "Analyze: {{input}}"
}

Default step timeout is 120 seconds. Complex LLM operations may need 120–300s.

On timeout, behavior follows the step's error_mode:

  • Fail → workflow fails immediately
  • Skip → step skipped, next step runs
  • Retry → counts as a failure, retried if attempts remain

FanOut timeout behavior

In FanOut steps, each parallel step has its own individual timeout. If any FanOut step fails (after error_mode processing), the entire workflow fails.

Diagnosing failures

# Check workflow run status and which step failed
hoziron workflow status <run-id>

The status shows:

  • Which step failed
  • The error message
  • Whether timeout or agent error
  • Retry attempt count (for Retry mode)

Best practices

  • Use Fail for critical steps where partial results are useless
  • Use Skip for enrichment or optional processing
  • Use Retry for steps that call external APIs (transient failures)
  • Set generous timeouts for complex analysis steps (300s+)
  • Keep timeouts tight for simple lookups (30–60s)

Next steps


Related: