cloverleaf-larry

bryan/cloverleaf-larry

Fork 0

Commit Graph

Author	SHA1	Message	Date
Bryan Johnson	c2bba7be90	v0.5.5: @@VALUE inline PHI syntax + name canonicalization Bryan asked for an easier-to-remember inline PHI marker than {{phi:VALUE}} and for name forms like SMITH^JOHN / Smith, John / John Smith / JOHN SMITH to all collapse to the same hash. Both shipped. INLINE SYNTAX (in addition to the legacy {{phi:VALUE}} which still works): @@VALUE unbracketed — VALUE has no whitespace e.g. @@12345 @@SMITH^JOHN @@V789 @@VALUE@@ bracketed — VALUE may contain spaces e.g. @@John Smith@@ @@Smith, John@@ Parser is 2-pass to disambiguate mixed forms in the same prompt: bracketed markers are matched first (via grep -oE with a regex that excludes leading/ trailing whitespace inside the brackets), then the unbracketed pass scans the remaining text. Verified against: "look for @@12345 in PID.3 for @@John Smith@@ DOB @@01/15/1985 ..." extracts 4 markers correctly and routes each to its category. AUTO-CATEGORY DETECTION (lib/hl7-sanitize.sh: detect_category): pure digits 4-15 → MRN 9 digits with dashes → SSN date-shaped → DOB caret or comma → NAME 2+ alpha tokens → NAME else → MANUAL CANONICALIZATION (lib/hl7-sanitize.sh: normalize_value): NAME: lowercase, replace ',^/' with spaces, sort unique alpha tokens SMITH^JOHN, Smith John, John Smith, JOHN SMITH → "john smith" DOB: parse to YYYY-MM-DD (GNU date or BSD date fallback) SSN: strip dashes/whitespace MRN/MANUAL: trim outer whitespace only TABLE SCHEMA bumped to 4 columns (token / category / canonical / original). Legacy 3-column rows still read fine — lookups key on column 3 which is "canonical" in new rows and "value" in legacy rows (mismatches just create a new token, no corruption). Detokenize prefers column 4, falls back to column 3 for legacy compat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:11:18 -07:00
Bryan Johnson	b9415f3b57	v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing Bryan's ask: use Larry on prod data without PHI ever leaving the client box. Added: lib/hl7-sanitize.sh — tokenize PHI fields in HL7 messages lib/hl7-desanitize.sh — reverse op (local view-time unmask) Tokenization model: - Replace PHI fields with [[CATEGORY_NNNN]] tokens (MRN, NAME, DOB, ADDR, PHONE, ACCT, SSN, PROV, VISIT, etc.) - Same value → same token across messages (deterministic via local lookup table; analysis can still correlate patients). - Lookup table at $LARRY_HOME/sanitize/lookup.tsv mode 0600 — never leaves the client. - Default PHI rule set covers PID, PV1, NK1, GT1, IN1, OBR, OBX, DG1, ORC; --rules-file to extend. - --strict also tokenizes unknown Z segments wholesale. Prompt-side preprocessing in larry.sh: - {{phi:VALUE}} inline marker, auto-category lookup - {{phi:CATEGORY:VALUE}} explicit category - Replaced with the token BEFORE the user input enters conversation history. The original never reaches the API. - Local feedback "phi> {{phi:...}} → [[TOKEN]]" printed to terminal only. New REPL slash commands: /phi <value> tokenize a single value, print the token /unmask <token> show original (local terminal only, never API) /tokens show full PHI ↔ token lookup table New tools in larry.sh schema: hl7_sanitize agent can sanitize a file before reading PHI tokenize-value / detokenize-value (subcommands of hl7-sanitize.sh) Persona update (agents/larry.md): - Documented PHI mode and rules for proactive sanitize-first behavior MANUAL.md updated with the full PHI section including limitations. Brings total native tools to 29. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 10:29:20 -07:00

Author

SHA1

Message

Date

Bryan Johnson

c2bba7be90

v0.5.5: @@VALUE inline PHI syntax + name canonicalization

Bryan asked for an easier-to-remember inline PHI marker than {{phi:VALUE}}
and for name forms like SMITH^JOHN / Smith, John / John Smith / JOHN SMITH
to all collapse to the same hash. Both shipped.

INLINE SYNTAX (in addition to the legacy {{phi:VALUE}} which still works):
  @@VALUE         unbracketed — VALUE has no whitespace
                  e.g. @@12345  @@SMITH^JOHN  @@V789
  @@VALUE@@       bracketed   — VALUE may contain spaces
                  e.g. @@John Smith@@  @@Smith, John@@

Parser is 2-pass to disambiguate mixed forms in the same prompt: bracketed
markers are matched first (via grep -oE with a regex that excludes leading/
trailing whitespace inside the brackets), then the unbracketed pass scans
the remaining text. Verified against:
  "look for @@12345 in PID.3 for @@John Smith@@ DOB @@01/15/1985 ..."
extracts 4 markers correctly and routes each to its category.

AUTO-CATEGORY DETECTION (lib/hl7-sanitize.sh: detect_category):
  pure digits 4-15      → MRN
  9 digits with dashes  → SSN
  date-shaped           → DOB
  caret or comma        → NAME
  2+ alpha tokens       → NAME
  else                  → MANUAL

CANONICALIZATION (lib/hl7-sanitize.sh: normalize_value):
  NAME: lowercase, replace ',^/' with spaces, sort unique alpha tokens
        SMITH^JOHN, Smith John, John Smith, JOHN SMITH → "john smith"
  DOB:  parse to YYYY-MM-DD (GNU date or BSD date fallback)
  SSN:  strip dashes/whitespace
  MRN/MANUAL: trim outer whitespace only

TABLE SCHEMA bumped to 4 columns (token / category / canonical / original).
Legacy 3-column rows still read fine — lookups key on column 3 which is
"canonical" in new rows and "value" in legacy rows (mismatches just create
a new token, no corruption). Detokenize prefers column 4, falls back to
column 3 for legacy compat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 10:11:18 -07:00

Bryan Johnson

b9415f3b57

v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing

Bryan's ask: use Larry on prod data without PHI ever leaving the client box.

Added:
  lib/hl7-sanitize.sh       — tokenize PHI fields in HL7 messages
  lib/hl7-desanitize.sh     — reverse op (local view-time unmask)

Tokenization model:
  - Replace PHI fields with [[CATEGORY_NNNN]] tokens (MRN, NAME, DOB,
    ADDR, PHONE, ACCT, SSN, PROV, VISIT, etc.)
  - Same value → same token across messages (deterministic via local
    lookup table; analysis can still correlate patients).
  - Lookup table at $LARRY_HOME/sanitize/lookup.tsv mode 0600 — never
    leaves the client.
  - Default PHI rule set covers PID, PV1, NK1, GT1, IN1, OBR, OBX,
    DG1, ORC; --rules-file to extend.
  - --strict also tokenizes unknown Z segments wholesale.

Prompt-side preprocessing in larry.sh:
  - {{phi:VALUE}}             inline marker, auto-category lookup
  - {{phi:CATEGORY:VALUE}}    explicit category
  - Replaced with the token BEFORE the user input enters conversation
    history. The original never reaches the API.
  - Local feedback "phi> {{phi:...}} → [[TOKEN]]" printed to terminal only.

New REPL slash commands:
  /phi <value>        tokenize a single value, print the token
  /unmask <token>     show original (local terminal only, never API)
  /tokens             show full PHI ↔ token lookup table

New tools in larry.sh schema:
  hl7_sanitize        agent can sanitize a file before reading PHI
  tokenize-value / detokenize-value (subcommands of hl7-sanitize.sh)

Persona update (agents/larry.md):
  - Documented PHI mode and rules for proactive sanitize-first behavior

MANUAL.md updated with the full PHI section including limitations.

Brings total native tools to 29.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-26 10:29:20 -07:00

2 Commits