PII & Data Protection

How Hoziron protects personally identifiable information — detection, tokenization, the single operator-owned policy authority, the two structurally-enforced egress seams, memory isolation, the unified audit chain, and data destruction. Grounded in crates/platform/hoziron-core/src/pii/, crates/platform/hoziron-core/src/mediation/, crates/platform/hoziron-core/src/audit/, and ADR-009, ADR-027, ADR-042, ADR-047, ADR-049, ADR-039.

This is a fully implemented, actively-enforced pipeline (detect → tokenize → hydrate, PR #195 onward) — not a placeholder. Any claim that PII processing is a Phase-1 no-op reflects a stale, pre-ADR-049 state of the codebase.

Data Protection Architecture

PII Pipeline

Three components (crates/platform/hoziron-core/src/pii/detector.rs, tokenizer.rs, hydrator.rs), each a pure mechanical component — trust and policy decisions live entirely in the orchestration layer around them, never inside the components themselves.

Detection

PiiDetector.detect(input) -> Vec<PiiSpan>

Regex-first (regex-lite, sub-millisecond, no backtracking) with secondary checksum validators (Luhn, IBAN mod-97, Modulus-11, German ID check digit, SA ID date+Luhn) applied only to regex matches, to eliminate false positives without the latency cost of an ML-based NER model:

[
  {"start": 45, "end": 56, "category": "sa_id_number"},
  {"start": 120, "end": 135, "category": "policy_number"},
  {"start": 200, "end": 215, "category": "za_phone"}
]

Optional NER Tier: Person Names & Addresses (ADR-078)

Regex has no lexical form to match against for a person's name or a free-form incident address, so a second, optional tier sits behind the same PiiDetector trait: an in-process encoder span model (GLiNER over ONNX Runtime — no Python sidecar, no network call at inference time) detecting person_name and a free-form address category. The regex + checksum tier above is retained in full and always takes precedence where both tiers detect the same span.

This tier ships as a model artifact separate from hoziron-server itself — never baked into the binary, never downloaded at runtime, under any configuration. A deployment that never places the artifact simply runs the regex-only tier — nothing crashes, but this is never a silent gap:

Every boot without the artifact logs an unmissable WARN stating plainly that person-name and address detection are not active, and how to install it. This fires unconditionally — regardless of whether your carrier-pii-policy.toml references person_name at all, or even exists.
The same state is on /health, under pii_ner_tier (active, resolved tokenizer_path/model_path, and — once active — the SHA-256 of each loaded file), so it's checkable without log access.
If your carrier policy names person_name (or any other NER-only category) while the artifact is absent, the server refuses to boot — a policy claiming protection the running server cannot actually provide is a configuration error, not a warning.

Tuning false positives (ADR-078 R6). The tier's raw confidence scores pass through a static, config-declared shaping table — per-category thresholds, trust multipliers, and an outright drop list — before a span reaches the pipeline. This is how a class of error (e.g. a role noun like "assessor" mistagged as a person) gets tuned out on a config reload, with no retraining and no rebuild. See the PII Engine guide for the [pii_policy.ner_scoring] schema.

Install (one command)

The artifact — tokenizer.json and model.onnx — is published at dl.hoziron.com under models/gliner-multi-pii/<version>/, both as loose files and as a single tarball, gliner-multi-pii-v1.tar.gz, whose extraction lands both files at the exact names the server expects — no manual renaming, no risk of a model.onnx.1 half-install.

Default — verify, then extract (two steps, needs the full 1.1 GiB file on disk first):

DEST="${HOZIRON_HOME:-$HOME/.hoziron}/models/ner"
mkdir -p "$DEST"
curl -sSL "https://dl.hoziron.com/models/gliner-multi-pii/v1/gliner-multi-pii-v1.tar.gz" -o /tmp/gliner-multi-pii-v1.tar.gz
curl -sSL "https://dl.hoziron.com/models/gliner-multi-pii/v1/gliner-multi-pii-v1.tar.gz.sha256" -o /tmp/gliner-multi-pii-v1.tar.gz.sha256
(cd /tmp && sha256sum -c gliner-multi-pii-v1.tar.gz.sha256)
tar -xzf /tmp/gliner-multi-pii-v1.tar.gz -C "$DEST"

Fast path — stream, no intermediate file. Skips the 1.1 GiB temp file, but cannot verify the checksum before extracting — only use this over a transport you already trust (e.g. an internal mirror you control):

mkdir -p "${HOZIRON_HOME:-$HOME/.hoziron}/models/ner"
curl -sSL "https://dl.hoziron.com/models/gliner-multi-pii/v1/gliner-multi-pii-v1.tar.gz" \
  | tar -xz -C "${HOZIRON_HOME:-$HOME/.hoziron}/models/ner"

The loose files (tokenizer.json, model.onnx, SHA256SUMS) stay published alongside the tarball too — useful for resuming a flaky download, re-fetching a single file, or inspecting either file without unpacking. Installing only one of the pair is a corrupt install, not an absent one, and is a hard boot error (never silently treated as "tier absent").

Where the files go

The motion is identical everywhere — extract the tarball into the right destination for how you're running the server:

Deployment	Destination
Binary	`$HOZIRON_HOME/models/ner/` (default `~/.hoziron/models/ner/`)
Docker	`/data/models/ner/` (the image sets `HOZIRON_HOME=/data`)

Docker / Compose: the image never bundles the model (see the Dockerfile NER comment for the bake-vs-mount rationale — image size, update cadence, and matching the existing /data/licence.json convention). Extract onto the named volume or bind mount before starting the container:

docker run --rm -v hoziron-data:/data alpine sh -c \
  "mkdir -p /data/models/ner"
# then extract gliner-multi-pii-v1.tar.gz into that same volume, e.g. via
# `docker cp` of the already-extracted files, or a bind mount at build time.

There is no supported Kubernetes/Helm deployment path today (see the deployment guide); the in-tree charts/hoziron/ chart is not an officially supported install method. If you use it anyway, the same motion applies to the /data PVC it already provisions for licence.json and registry state — extract into models/ner/ on it before (or after) the pod starts, budgeting at least 1.5 GiB of headroom above your existing sizing (1.1 GiB for the model artifact itself, plus slack for the extraction step and any loose-file re-fetch).

Licensing. The GLiNER model weights (Apache-2.0) and the statically linked ONNX Runtime binary (MIT, Microsoft) each carry their own third-party attribution, documented in .notices/gliner-model/ and .notices/onnxruntime/ alongside the existing OpenFang attribution.

Tokenization

PiiTokenizer.tokenize(input, spans) -> (tokenized_text, token_map)

Replaces detected spans with opaque UUID-based tokens, format <<PII:{uuid_hex_32}>>:

Input: "Customer John Smith (SA ID: 8501015800081) filed claim"
Output: "Customer <<PII:a1b2c3d4...>> (SA ID: <<PII:e5f67890...>>) filed claim"

Token→value mappings live in a local, ephemeral SQLite vault ($HOZIRON_HOME/data/pii_vault.db, WAL mode, default 24-hour per-token TTL enforced at lookup, not just lazy purge). No format-preserving encryption or external vault dependency — the vault is per-deployment and stays on the same machine, reinforcing data sovereignty.

Hydration

PiiHydrator.hydrate(tokenized_text, token_map) -> original_text

The hydrator always restores tokens when invoked — it is trust-unaware by design. Whether it gets called at all is decided entirely by policy, one layer up. This separation is load-bearing: it is what let ADR-027 (deny-by-default hydration) and ADR-049 (unified policy authority) tighten the calling logic repeatedly without ever touching the hydrator itself.

The Policy Authority: One File, Two Dimensions (ADR-049)

There is no per-agent PII configuration. PII-egress policy is a single, operator-owned, file-sourced (carrier-pii-policy.toml) authority (ADR-049), reloaded on restart or file-watch — the same mechanism the carrier licence already uses. Agents have zero input into PII egress; there is no PUT /agents/{id}/trust-policy, set_agent_trust_policy, or clear_agent_trust_policy.

# carrier-pii-policy.toml — operator-owned, no runtime mutation API

[pii_policy.llm]
local_only = ["sa_id_number", "bank_account", "medical_record"]
# every other detected type -> cloud_permitted (tokenised)

[[pii_policy.tool_rules]]
destination = "claims-core-sor"          # exact, licence-designated MCP/carrier target
hydrate     = ["sa_id_number", "bank_account"]
reject      = ["medical_record"]
# everything else for this destination -> tokenised (implicit default)

[[pii_policy.tool_rules]]
destination = "notify-sms"
hydrate     = ["za_phone"]

Tokenization is the unconditional substrate — any PII bound for a cloud LLM or an unlisted tool destination is tokenized before egress, always. The file governs only the two escalations off that floor.
LLM dimension is binary per PII type: local_only (may not leave the box even tokenized — for SA ID numbers, bank accounts, medical records) or the default cloud_permitted.
Tool dimension is default-tokenize with explicit per-(destination, type) hydrate/reject escalations, evaluated independently per co-occurring type in one call (an sa_id_number can hydrate to claims-core-sor while a co-present email with no rule stays tokenized in the same call).
Hydrated PII to a cloud LLM is structurally prohibited — there is no hydrate verb in the LLM dimension at all, by design, not omission. A task that genuinely needs a raw value (e.g. matching an SA ID on a document) is a local task by definition; only the non-PII verdict travels.
Deny-by-default hydration to the API caller (ADR-027, generalized by ADR-049): a response is only hydrated back to the calling client if "api" is explicitly in the policy's trusted destinations. An empty/absent policy trusts nothing.
local_only is the sole force-local source; there is no rules[]-keyed evaluation path.
There is no data-class/sensitivity-label subsystem. DataClassBoundary/SensitivityLabel enforcement does not exist in any production path. Agent-level data clearance, if ever needed, would be operator-authored and PII-type-keyed.

The Two Mediated Egress Seams

Two structurally-enforced chokepoints — not conventions, not "please remember to check this" — cover the two directions data can cross the process boundary.

Outbound: CoreOutboundMediator (ADR-042, generalized by ADR-047)

Every outbound write leaving the process toward a carrier system — MCP tools/call, direct REST, direct SOAP — crosses CoreOutboundMediator.mediate() before it leaves, regardless of transport. OutboundCall{tool_name, target_server, arguments} is transport-agnostic by construction, so the mediator never needs to special-case MCP vs. REST vs. SOAP.

A structural test ("C3") asserts there is exactly one call site to the underlying transport (transport.execute(/conn.call_tool() and that mediate() precedes it — enforced at build time, not by code review discipline. There is deliberately no unmediated_call_tool sitting next to the mediated one.

Inbound: Mediated Ingress Authority (ADR-061)

An agent turn may only be originated by hoziron-core, through a mediated ingress path, with a request type the kernel cannot construct on its own. The kernel's raw send_message* family is not a callable turn-origin from outside core; kernel-internal triggers (cron, workflow resumption) originate turns only through a core-authored dispatch closure (drive_run), never a direct kernel-internal send. This closes the one class of gap the outbound seam does not cover: content entering an agent's context from outside the agent domain without crossing the inbound PII/policy check at all. Interior traffic (agent→agent, already-tokenized) is not re-mediated — the trust boundary is the agent domain, not each hop within it.

PII at Workflow Step Boundaries

The token map is held by the engine, not the agents — an agent cannot reverse-tokenize data it did not itself receive real values for.

Memory Isolation

Structural, Not Runtime-Checked

Every memory operation is bound to a single agent's ID at construction — hoziron-memory's BoundMemoryHandle never exposes a method that accepts a free agent_id parameter, so there is no "caller" argument for a workflow, tool result, or another agent to spoof:

This is the one, sole memory access-control layer in production. There is no hoziron-core::ScopedMemory type — the admin memory REST API (/agents/{id}/memory*) is a set of non-functional stub endpoints, not a second access-control layer. BoundMemoryHandle — bound once per session by hoziron-runtime::agent_loop/tool_runner — is the entire cross-agent isolation guarantee.

Agent Memory Destruction

When an agent is uninstalled, all its data is irrecoverably destroyed as part of the Installable::uninstall complete-teardown contract (ADR-044):

Semantic memory fragments — loop-forget all recalled fragments until empty
KV scope — delete all keys in the agent's namespace
Knowledge graph entities — query and remove all entities sourced from this agent (the knowledge graph gained explicit agent_id scoping alongside KV — it had none before)
Session history — removed via kernel registry removal

Unified Audit Chain (ADR-039)

Two event sources, one chain: the kernel emits policy-blind MechanicalEvents (dispatch, token usage, tool call, retry, circuit-breaker trip) through a core-provided AuditSink trait; core writes PolicyEvents (PII tokenization, routing decisions, entitlement enforcement, trust violations) directly, in a vocabulary the kernel structurally cannot name.

Hash Preimage (`hoziron-audit-chain-v2`)

entry_hash = sha256(
    "hoziron-audit-chain-v2"                              // domain-separation tag
    || prev_hash
    || seq (u64, 8 bytes LE, strictly monotonic, gap-free)
    || len_prefixed(timestamp) || len_prefixed(identity) || len_prefixed(role)
    || len_prefixed(action) || len_prefixed(target) || len_prefixed(result)
    || metadata_hash                                       // sha256 of canonical serialization
)

Every variable-length field is length-prefixed rather than delimiter-joined, so a byte occurring inside a field value (e.g. target) can never be mistaken for a field separator — distinct field tuples cannot hash identically. There is no migration path for a chain built under a different preimage encoding — such a chain fails verify_chain() from the point the encoding stops matching; a mismatch at the very first entry checked is the signature of an encoding mismatch, not tampering, and verify_chain's output carries that caveat.

metadata is inside the hash boundary (metadata_hash, canonical serialization) — the field carrying the most compliance-critical context (PII types detected, sensitivity labels, routing rationale) cannot be mutated without breaking the chain.

A seq gap (e.g. a deleted entry) breaks verification even if the remaining entries re-chain cleanly content-wise — this is what the monotonic seq in the hash buys over a content-only chain.

Pruning With Integrity

When entries exceed max_entries, oldest entries are pruned. The hash of the last-pruned entry is saved as an explicit, recorded checkpoint — never a silent reset of prev_hash to genesis — and the verify API exposes the active anchor so pre/post-cutover segments can each be verified independently.

Secret Handling Guarantees

Guarantee	How It's Enforced
API keys never stored in plaintext	Only argon2id hashes in the key store
Provider keys never in config files	`api_key_env` references env var name, not value
Keys never in logs/errors/audit metadata	No logging of resolved key values; no-key-in-metadata is load-bearing twice over now that metadata is hash-covered
Key store / vault file permissions	0600 on Unix, from file creation, never a broader default narrowed later
Provider credentials during routing	`Zeroizing`-wrapped end-to-end, manual redacting `Debug` on `ResolvedModelTarget`
Constant-time key validation	All keys scanned regardless of match position

Data Sovereignty

Control	Mechanism
Data stays on-premise	Self-hosted deployment; driver-declared sovereignty (`AlwaysLocal`/attested/`AlwaysCloud`) rather than an operator-typed IP heuristic — see provider-routing.md
No data sent to cloud for named types	`local_only` in the carrier PII policy — cannot leave the box even tokenized
One operator-owned PII authority	Carrier policy file, no agent input, no runtime mutation API (ADR-049)
Cross-agent isolation	`BoundMemoryHandle` — the sole, structural memory access-control layer
Mediated egress, any transport	`CoreOutboundMediator` — one chokepoint, structurally enforced (ADR-042/047)
Mediated ingress	Core-only turn origination (ADR-061)
Audit trail	Unified, tamper-evident hash chain covering both mechanical and policy events (ADR-039)
Data deletion	Irrecoverable destruction on agent uninstall, across memory, KV, graph, and session layers

Related:

../architecture/data-flow.md — this pipeline in the context of one full request
provider-routing.md — the LLM-dimension consumer of this policy
workflow-engine.md — billing riding the same outbound seam
docs/decisions/009-pii-engine.md, 027-deny-by-default-pii-hydration.md, 036-hoziron-openfang-boundary.md, 039-subsystem-deduplication.md, 042-outbound-tool-call-mediation-seam.md, 047-unified-outbound-egress.md, 049-unified-pii-policy-authority.md, 061-agent-turn-ingress-authority.md, 078-probabilistic-pii-detection.md