Workflow Error Handling
What you'll accomplish: Configure error modes (Fail, Skip, Retry) and timeout behavior for workflow steps.
Error modes
Each step declares an error_mode that controls behavior on failure:
Fail (default)
The step must succeed. On failure or timeout, the entire workflow is marked Failed.
{
"name": "critical-step",
"error_mode": "Fail",
"agent": {"ByName": "processor"},
"prompt_template": "Process: {{input}}"
}
Skip
On failure or timeout, the step is skipped and execution continues with the next step:
{
"name": "optional-enrichment",
"error_mode": "Skip",
"agent": {"ByName": "enricher"},
"prompt_template": "Enrich: {{input}}"
}
A warning is logged but the workflow proceeds.
Retry
The step is retried up to max_retries times before the workflow fails:
{
"name": "flaky-api-call",
"error_mode": {"Retry": {"max_retries": 3}},
"agent": {"ByName": "api-caller"},
"prompt_template": "Call API: {{input}}"
}
Retries are immediate (no backoff delay). Each retry sends the same prompt to the same agent.
Timeout configuration
Each step has a configurable timeout (1–3600 seconds):
{
"name": "analysis",
"timeout_secs": 300,
"agent": {"ByName": "analyst"},
"prompt_template": "Analyze: {{input}}"
}
Default step timeout is 120 seconds. Complex LLM operations may need 120–300s.
On timeout, behavior follows the step's error_mode:
- Fail → workflow fails immediately
- Skip → step skipped, next step runs
- Retry → counts as a failure, retried if attempts remain
FanOut timeout behavior
In FanOut steps, each parallel step has its own individual timeout. If any FanOut step fails (after error_mode processing), the entire workflow fails.
Diagnosing failures
# Check workflow run status and which step failed
hoziron workflow status <run-id>
The status shows:
- Which step failed
- The error message
- Whether timeout or agent error
- Retry attempt count (for Retry mode)
Best practices
- Use Fail for critical steps where partial results are useless
- Use Skip for enrichment or optional processing
- Use Retry for steps that call external APIs (transient failures)
- Set generous timeouts for complex analysis steps (300s+)
- Keep timeouts tight for simple lookups (30–60s)
Next steps
Related: