PII & Data Protection
How Hoziron protects personally identifiable information — detection, tokenization, memory isolation, audit trails, and data destruction.
Data Protection Architecture
PII Pipeline
Each agent has its own PII pipeline instance composed of three traits:
Detection
PiiDetector.detect(input, trust_policy) → Vec<PiiSpan>
Scans input text for PII spans and returns their byte offsets and categories:
[
{"start": 45, "end": 56, "category": "ssn"},
{"start": 120, "end": 135, "category": "policy_number"},
{"start": 200, "end": 215, "category": "full_name"}
]
Tokenization
PiiTokenizer.tokenize(input, spans, trust_policy) → (tokenized_text, token_map)
Replaces detected PII with opaque tokens:
- Input:
"Customer John Smith (SSN: 123-45-6789) filed claim" - Output:
"Customer [PII_NAME_001] (SSN: [PII_SSN_001]) filed claim" - TokenMap:
{"PII_NAME_001": "John Smith", "PII_SSN_001": "123-45-6789"}
Hydration
PiiHydrator.hydrate(tokenized_text, token_map, trust_policy) → original_text
Restores original PII values from tokens. Used only for trusted destinations.
Pipeline Composition
Trust Policies
Each agent's PII behavior is governed by a TrustPolicy:
{
"enabled": true,
"detectors": ["ssn", "email", "full_name", "policy_number", "phone"],
"trusted_destinations": ["claims-system-api", "internal-database"]
}
| Field | Purpose |
|---|---|
enabled | Master switch for PII processing on this agent |
detectors | Which PII categories to scan for |
trusted_destinations | Destinations that receive hydrated (real) data |
Current Implementation Status
Phase 1 (current): No-op pipeline — data passes through unchanged. The architecture and trait boundaries are in place, but detection is not yet active.
Phase 2 (planned): Real PII detection engine with configurable rules per industry vertical (insurance-specific patterns: policy numbers, claim numbers, NPI, etc.).
PII at Workflow Boundaries
Data flowing between agents in a workflow passes through the target agent's PII pipeline:
This ensures:
- Each agent only sees PII appropriate for its trust level
- An agent processing claims data doesn't expose raw SSNs to a downstream reporting agent
- The token map is held by the engine (not the agents) — agents cannot reverse-tokenize
Memory Isolation
Per-Agent Scoping
Every memory operation validates the caller's identity:
Cross-Agent Access Denied
The ScopedMemory wrapper is the enforcement point. Even if an agent somehow obtains another agent's ID (from a workflow context, a tool result, etc.), the memory layer denies access:
{
"error": {
"category": "MemoryViolation",
"message": "Cross-agent memory access denied",
"details": {"caller": "agent-a-id", "scope_owner": "agent-b-id"}
}
}
Agent Memory Destruction
When an agent is deleted (delete_agent), all its data is irrecoverably destroyed:
- Semantic memory fragments — loop-forget all recalled fragments until empty
- KV scope — delete all keys in the agent's namespace
- Knowledge graph entities — query and remove all entities sourced from this agent
- Session history — removed by the kernel's registry removal
This satisfies the 5-second destruction guarantee.
Audit Trail
Merkle Hash Chain
Every audit entry includes a SHA-256 hash that chains to the previous entry:
hash = SHA-256(prev_hash | timestamp | identity | role | action | target | result)
If any entry is modified after the fact, all subsequent hashes become invalid — making tampering detectable.
What Gets Audited
Every authenticated API request is recorded:
| Field | Content |
|---|---|
timestamp | ISO 8601 when the action occurred |
identity | Caller's key name or OIDC subject |
role | Caller's role at time of request |
action | HTTP method + path (e.g., POST /agents) |
target | Resource acted upon (agent ID, key ID, etc.) |
result | Outcome: success, forbidden, error |
hash | SHA-256 chain hash |
Verification
The verify_chain() operation streams entries and recomputes each hash:
For each entry (oldest → newest):
expected = SHA-256(prev_hash | entry.fields...)
if expected != entry.hash: CHAIN BROKEN at entry N
Verification is O(n) time, O(1) memory (streaming).
Pruning with Integrity
When entries exceed max_entries (default 100,000), oldest entries are pruned. To maintain chain verifiability:
- The hash of the last-pruned entry is saved as a "checkpoint" in metadata
- Future verification anchors from this checkpoint instead of genesis
- Entries before the checkpoint are gone but the chain after it remains verifiable
Configuration
[audit]
enabled = true
max_entries = 100000
Storage
- SQLite with WAL mode (concurrent reads without blocking writes)
- Indexed on
timestampandidentityfor efficient queries - Audit writes are non-blocking — spawned as background tasks to avoid adding latency to API responses
Secret Handling Guarantees
| Guarantee | How It's Enforced |
|---|---|
| API keys never stored in plaintext | Only argon2id hashes in the database |
| Provider keys never in config files | api_key_env references env var name, not value |
| Keys never in error messages | Error reports env var name only |
| Keys never in logs | No logging of resolved key values |
| Key store file permissions | 0600 on Unix (owner read/write only) |
| Key shown only once | secret field returned only at creation time |
| Constant-time validation | All keys scanned regardless of match position |
| Memory wiped on delete | All agent data destroyed across all layers |
Data Sovereignty
For regulated industries (insurance, healthcare, finance):
| Control | Mechanism |
|---|---|
| Data stays on-premise | Self-hosted deployment, local models (Ollama/vLLM) |
| No data sent to cloud | Air-gapped mode with enabled = false on cloud providers |
| Per-agent PII policy | Trust policies control what each agent can see |
| Cross-agent isolation | ScopedMemory prevents any data leakage between agents |
| Audit trail | Complete, tamper-evident record of all API operations |
| Data deletion | Irrecoverable destruction on agent delete |
| Network control | IP allowlist + TLS + CORS restrict who connects |