Bryan asked for an easier-to-remember inline PHI marker than {{phi:VALUE}}
and for name forms like SMITH^JOHN / Smith, John / John Smith / JOHN SMITH
to all collapse to the same hash. Both shipped.
INLINE SYNTAX (in addition to the legacy {{phi:VALUE}} which still works):
@@VALUE unbracketed — VALUE has no whitespace
e.g. @@12345 @@SMITH^JOHN @@V789
@@VALUE@@ bracketed — VALUE may contain spaces
e.g. @@John Smith@@ @@Smith, John@@
Parser is 2-pass to disambiguate mixed forms in the same prompt: bracketed
markers are matched first (via grep -oE with a regex that excludes leading/
trailing whitespace inside the brackets), then the unbracketed pass scans
the remaining text. Verified against:
"look for @@12345 in PID.3 for @@John Smith@@ DOB @@01/15/1985 ..."
extracts 4 markers correctly and routes each to its category.
AUTO-CATEGORY DETECTION (lib/hl7-sanitize.sh: detect_category):
pure digits 4-15 → MRN
9 digits with dashes → SSN
date-shaped → DOB
caret or comma → NAME
2+ alpha tokens → NAME
else → MANUAL
CANONICALIZATION (lib/hl7-sanitize.sh: normalize_value):
NAME: lowercase, replace ',^/' with spaces, sort unique alpha tokens
SMITH^JOHN, Smith John, John Smith, JOHN SMITH → "john smith"
DOB: parse to YYYY-MM-DD (GNU date or BSD date fallback)
SSN: strip dashes/whitespace
MRN/MANUAL: trim outer whitespace only
TABLE SCHEMA bumped to 4 columns (token / category / canonical / original).
Legacy 3-column rows still read fine — lookups key on column 3 which is
"canonical" in new rows and "value" in legacy rows (mismatches just create
a new token, no corruption). Detokenize prefers column 4, falls back to
column 3 for legacy compat.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bryan's ask: use Larry on prod data without PHI ever leaving the client box.
Added:
lib/hl7-sanitize.sh — tokenize PHI fields in HL7 messages
lib/hl7-desanitize.sh — reverse op (local view-time unmask)
Tokenization model:
- Replace PHI fields with [[CATEGORY_NNNN]] tokens (MRN, NAME, DOB,
ADDR, PHONE, ACCT, SSN, PROV, VISIT, etc.)
- Same value → same token across messages (deterministic via local
lookup table; analysis can still correlate patients).
- Lookup table at $LARRY_HOME/sanitize/lookup.tsv mode 0600 — never
leaves the client.
- Default PHI rule set covers PID, PV1, NK1, GT1, IN1, OBR, OBX,
DG1, ORC; --rules-file to extend.
- --strict also tokenizes unknown Z segments wholesale.
Prompt-side preprocessing in larry.sh:
- {{phi:VALUE}} inline marker, auto-category lookup
- {{phi:CATEGORY:VALUE}} explicit category
- Replaced with the token BEFORE the user input enters conversation
history. The original never reaches the API.
- Local feedback "phi> {{phi:...}} → [[TOKEN]]" printed to terminal only.
New REPL slash commands:
/phi <value> tokenize a single value, print the token
/unmask <token> show original (local terminal only, never API)
/tokens show full PHI ↔ token lookup table
New tools in larry.sh schema:
hl7_sanitize agent can sanitize a file before reading PHI
tokenize-value / detokenize-value (subcommands of hl7-sanitize.sh)
Persona update (agents/larry.md):
- Documented PHI mode and rules for proactive sanitize-first behavior
MANUAL.md updated with the full PHI section including limitations.
Brings total native tools to 29.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>