cloverleaf-larry

History

Bryan Johnson 58e6bf4e03 v0.7.3: automatic PHI detection (tiered detection + blacklist contexts) Adds automatic PHI tokenization on two surfaces: user input and HL7-shaped tool results. Supersedes Bryan's reverted `af2ffe8` prototype with a tiered confidence model, explicit blacklist contexts, structured audit log, and tool-result coverage. Bryan's directive: "Err on the side of caution and tokenize anything you think you may need to as long as it doesn't break the tools." Priority order: (1) don't break tools (constraint), (2) catch all PHI (goal), (3) minimize false positives (secondary). Detection — four-tier model (first match wins per token): Tier 1 DEFINITE SSN (with dashes), email, formatted phone, NPI with explicit "NPI:" prefix. Always tokenize. Tier 2 CONTEXTUAL Numeric value preceded by MRN/Patient/DOB/Account/ Visit/Acct/Record/Birth within 20 chars. Always. Tier 3 HL7-CTX Plausibly-PHI-shaped values when line mentions PID.3/5/7/11/13/18, NK1., GT1., IN1.16-20. Aggressive — prompts in confirm mode. Tier 4 KNOWN Value already exists in $LARRY_HOME/sanitize/lookup.tsv. Tier-4 scans the full set of categories actually present in the table (not a hardcoded shortlist), so any category Bryan has used before is checked. Blacklist contexts (NEVER tokenize, even on tier match): * Path-like (/, ./, ../, ~/, contains /) * HL7 field references like PID.18 — the digit after the dot is a field index, not an MRN (spec verification scenario #5) * Version strings (vN.N.N, semver) and ISO dates (overridden by explicit DOB/Birth context so "DOB 1980-01-15" still tokenizes) * Port keywords (:NNNN, port NNNN, tcp/udp NNNN, LISTEN/PORT=) * Error/status codes (error NNN, code NNN, HTTP NNN, rc=N) * JSON key position (value followed by ": or :) * Fenced code blocks (``` ... ``` skipped via awk redactor) * Timestamps (epoch ms 13+ digits, epoch s 10 digits starting 1) Tool-result surface — routed through hl7-sanitize.sh: * Eligible tools: read_file (.hl7/.HL7/.txt/.TXT only), nc_msgs, hl7_field, hl7_diff * Eligibility further gated by _auto_phi_looks_like_hl7 shape check (segment headers MSH/PID/EVN/PV1 with \| delimiter) * Generic outputs (list_dir, grep_files, bash_exec, glob_files, ssh_exec, web search) NEVER scanned — spec is explicit about this * For HL7-shaped content we use the canonical field-aware pipeline rather than the prose detector, since segments are pipe-delimited and would otherwise be a single whitespace token. Both pipelines share lookup.tsv so tokens are stable across surfaces. Behavior controls: * env LARRY_AUTO_PHI: 1/on (default), 0/off, confirm * /phi-auto on\|off\|confirm\|status slash command * "!nophi " per-turn prefix override * Manual @@VALUE / {{phi:VALUE}} markers always win — preprocessed FIRST; auto-PHI fills gaps in things Bryan didn't manually mark. * After each pass, dim status line summarises: phi> auto-tokenized 3 value(s) [user_input]: MRN×1 EMAIL×1 SSN×1 Audit — JSONL log at $LARRY_HOME/log/auto-phi.log: { "ts": "...", "value": "...", "category": "...", "token": "...", "tier": "definite\|contextual\|hl7\|known\|hl7_pipeline", "surface": "user_input\|tool_result", "context": "..." } Mode 0600, parent dir 0700. Best-effort write; never fails the host call. Library changes (lib/hl7-sanitize.sh): * normalize_value: re-add EMAIL + PHONE arms + new NPI arm. EMAIL and PHONE arms were originally in `af2ffe8` (reverted with v0.7.1) — cited in the source comments. * normalize-value subcommand: exposes canonical normalization so auto-PHI can build per-session memory keys. Originally `af2ffe8`. * lookup-original subcommand: probes the table for an exact match without creating new tokens. Used by Tier-4 "already-known" detection. Implementation notes: * macOS bash 3.2 compatibility: ${pos: -20} returns empty when len < 20; use explicit ${pos:$((len-20))} guarded by length check. * Per-session decision cache (accept/decline) uses bash 4 associative arrays with a 3.2 fallback to pipe-delimited string membership. * Confirm-mode prompts only Tier 3-4 — Tier 1-2 hits are high-confidence and always tokenize even in confirm mode (Bryan: err on caution). * Detection loop iterates line-by-line so fenced-code redaction works and so left/right context is meaningful per token. Verification matrix (18/18 pass): 1 SSN tokenized, 2 Email tokenized, 3 MRN contextual, 4 bare digits skipped, 5 PID.18 skipped, 6 path skipped, 7 version skipped, 8 port skipped, 9 Tier-4 known catches custom category (EMP), 10 !nophi skips, 11 existing token left alone, 12 read_file .hl7 sanitizes all PHI fields, 13 .py not HL7-shaped, 14 list_dir not HL7-shaped, 15 mode=off skips, 16a /phi-auto off skips, 16b /phi-auto on tokenizes, 17 audit JSONL parseable. No regressions to v0.7.2 origin switching, v0.7.1 status-line position, v0.7.0 HL7 completion + mouse mode, v0.6.9 status state, v0.6.7 streaming, or any earlier OAuth/SSH/lessons work. MANIFEST unchanged. Divergence from `af2ffe8` (cited in source comments): * Tiered classifier (vs. flat regex set) — enables reasoning about WHY a value tokenized; gates confirm-mode behavior. * Explicit blacklist contexts — addresses spec false-positive cases that `af2ffe8` missed (HL7 field refs, ports, error codes, JSON keys). * Tool-result surface — `af2ffe8` only ran on user input. * Structured JSONL audit log — `af2ffe8` had no per-tokenization log. * /phi-auto semantics: on\|off\|confirm\|status (spec) vs. af2ffe8's /auto-phi on\|off\|aggressive\|confirm. * Dropped the loose "Title Case Title Case" pair detector and its name-allowlist — too high FP rate against narrative prose ("Larry Anywhere", "Mac Studio") and Bryan's name-allowlist couldn't keep up with the long tail. Name detection now Tier-3 (HL7-context only) and Tier-4 (already-known) only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-27 17:37:26 -07:00
..
csv-to-table.sh	v0.4.1: each / each-site / len2nl / csv-to-table / table-to-csv	2026-05-26 11:05:19 -07:00
each-site.sh	v0.4.1: each / each-site / len2nl / csv-to-table / table-to-csv	2026-05-26 11:05:19 -07:00
each.sh	v0.4.1: each / each-site / len2nl / csv-to-table / table-to-csv	2026-05-26 11:05:19 -07:00
hl7-desanitize.sh	v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing	2026-05-26 10:29:20 -07:00
hl7-diff.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
hl7-field.sh	v0.3.4: field-name aliases, dot/dash syntax, ops (=, !=, ~, !~), new formats	2026-05-26 10:35:46 -07:00
hl7-sanitize.sh	v0.7.3: automatic PHI detection (tiered detection + blacklist contexts)	2026-05-27 17:37:26 -07:00
hl7-schema.sh	v0.7.0: HL7-aware tab completion + REPL mouse mode	2026-05-27 16:15:11 -07:00
journal.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
len2nl.sh	v0.4.1: each / each-site / len2nl / csv-to-table / table-to-csv	2026-05-26 11:05:19 -07:00
lessons.sh	v0.3.2: lesson capture (local-first learning loop)	2026-05-26 10:00:37 -07:00
nc-create-thread.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-diff-interface.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-document.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-engine.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-find.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-inbound.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-insert-protocol.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-make-jump.sh	v0.3.0: initial release of Larry-Anywhere	2026-05-26 09:46:20 -07:00
nc-msgs.sh	v0.4.0: chain walk, OR/NOT filter groups, numeric/range ops, smat history	2026-05-26 10:58:16 -07:00
nc-parse.sh	v0.4.0: chain walk, OR/NOT filter groups, numeric/range ops, smat history	2026-05-26 10:58:16 -07:00
nc-regression.sh	v0.6.8: cross-env Cloverleaf workflows over SSH ControlMaster	2026-05-27 15:52:58 -07:00
nc-smat-diff.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-status.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-table.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-tclgen.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
nc-xlate.sh	v0.4.2: operational layer — engine ctrl, tables CRUD, xlate viz, smat-diff, create-thread, tclgen	2026-05-26 11:11:30 -07:00
oauth.sh	v0.6.6: strip CR from jq output + 0600 oauth file + TAB slash completion	2026-05-27 15:18:51 -07:00
ssh-helper.sh	v0.6.8: cross-env Cloverleaf workflows over SSH ControlMaster	2026-05-27 15:52:58 -07:00
table-to-csv.sh	v0.4.1: each / each-site / len2nl / csv-to-table / table-to-csv	2026-05-26 11:05:19 -07:00