v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing

Bryan's ask: use Larry on prod data without PHI ever leaving the client box.

Added:
  lib/hl7-sanitize.sh       — tokenize PHI fields in HL7 messages
  lib/hl7-desanitize.sh     — reverse op (local view-time unmask)

Tokenization model:
  - Replace PHI fields with [[CATEGORY_NNNN]] tokens (MRN, NAME, DOB,
    ADDR, PHONE, ACCT, SSN, PROV, VISIT, etc.)
  - Same value → same token across messages (deterministic via local
    lookup table; analysis can still correlate patients).
  - Lookup table at $LARRY_HOME/sanitize/lookup.tsv mode 0600 — never
    leaves the client.
  - Default PHI rule set covers PID, PV1, NK1, GT1, IN1, OBR, OBX,
    DG1, ORC; --rules-file to extend.
  - --strict also tokenizes unknown Z segments wholesale.

Prompt-side preprocessing in larry.sh:
  - {{phi:VALUE}}             inline marker, auto-category lookup
  - {{phi:CATEGORY:VALUE}}    explicit category
  - Replaced with the token BEFORE the user input enters conversation
    history. The original never reaches the API.
  - Local feedback "phi> {{phi:...}} → [[TOKEN]]" printed to terminal only.

New REPL slash commands:
  /phi <value>        tokenize a single value, print the token
  /unmask <token>     show original (local terminal only, never API)
  /tokens             show full PHI ↔ token lookup table

New tools in larry.sh schema:
  hl7_sanitize        agent can sanitize a file before reading PHI
  tokenize-value / detokenize-value (subcommands of hl7-sanitize.sh)

Persona update (agents/larry.md):
  - Documented PHI mode and rules for proactive sanitize-first behavior

MANUAL.md updated with the full PHI section including limitations.

Brings total native tools to 29.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Bryan Johnson 2026-05-26 10:29:20 -07:00
parent 6060cd28c1
commit b9415f3b57
7 changed files with 653 additions and 2 deletions

View File

@ -446,6 +446,94 @@ larry-tunnel.sh --stop
---
## PHI handling — sanitize / desanitize (`lib/hl7-sanitize.sh`, `lib/hl7-desanitize.sh`)
When working with prod data, tokenize PHI fields BEFORE they reach the API.
```bash
# Sanitize a file: replaces PHI fields with [[CATEGORY_NNNN]] tokens.
# Lookup table at ~/.larry/sanitize/lookup.tsv (mode 0600, never leaves the box).
lib/hl7-sanitize.sh /opt/cloverleaf/.../some.hl7 > /tmp/sanitized.hl7
# Pipe an entire smat-dump through sanitize:
lib/nc-msgs.sh ADTto_3m --limit 100 --format raw \
| lib/hl7-sanitize.sh > /tmp/sanitized-batch.hl7
# Strict mode also tokenizes unknown Z-segments wholesale:
lib/hl7-sanitize.sh --strict ./msg.hl7 > /tmp/sanitized.hl7
# See the current lookup table (PHI is here — DON'T share):
lib/hl7-sanitize.sh show-table
# Count entries:
lib/hl7-sanitize.sh count
# Clear the table (asks for confirmation):
lib/hl7-sanitize.sh clear-table
# Tokenize a single value (used by Larry's {{phi:...}} preprocessor):
lib/hl7-sanitize.sh tokenize-value --category MRN 12345
# → [[MRN_0001]]
# Detokenize a single token:
lib/hl7-sanitize.sh detokenize-value "[[MRN_0001]]"
# → 12345
# Desanitize a whole document (e.g. view Larry's tokenized output unmasked, locally):
cat larry-output.txt | lib/hl7-desanitize.sh | less
# Quick token lookup:
lib/hl7-desanitize.sh --token "[[NAME_0042]]"
# Override default PHI rules (rule file format: SEG|FIELD|CATEGORY per line)
lib/hl7-sanitize.sh --rules-file /tmp/my-rules.txt /tmp/msg.hl7
```
### Inside Larry (the REPL)
```
you> /phi 5720501458
phi> [[MRN_0001]] (use this in your next prompt)
you> find messages for {{phi:MRN:5720501458}} in last 3 days
phi> {{phi:MRN:5720501458}} → [[MRN_0001]]
(the actual prompt sent to Anthropic has [[MRN_0001]] — the original MRN never leaves the box)
you> /unmask [[NAME_0042]]
unmask> [[NAME_0042]] → MORRIS^SALLY^^^... (local only; never sent to API)
you> /tokens
(prints the full PHI ↔ token lookup table — local terminal only)
```
PHI inline syntax in any prompt:
- `{{phi:VALUE}}` — tokenize before send; auto-detects category (matches existing entries)
- `{{phi:MRN:12345}}` — explicit category=MRN (matches sanitized data)
- `{{phi:NAME:JOHN SMITH}}` — explicit category=NAME
### Default PHI rule set
Fields tokenized by default (override with `--rules-file`):
```
PID.2..7, .9, .11, .13, .14, .18, .19, .20, .21, .29, .30 (patient IDs, name, DOB, address, phone, account, SSN, license)
PV1.7, .8, .9, .17, .19, .50, .52 (providers, visit number)
NK1.2, .3, .4, .5, .6, .16 (next of kin)
GT1.3, .4, .5, .6, .7, .11, .12, .19 (guarantor)
IN1.16, .17, .18, .19, .20, .36, .49 (insurance)
OBR.10, .16, .32 / OBX.16 / DG1.3, .4 / ORC.10, .12 (orders/observations)
```
### Limitations (read these)
- **Your typed prompt can still leak PHI** if you don't use `{{phi:…}}` markers. Be deliberate.
- **Custom Z segments** aren't tokenized unless `--strict` is passed (which then redacts unknown Zs wholesale).
- **Free-text fields** (OBX.5 narratives, comments in NTE segments) can contain PHI in prose form. Default rules don't tokenize OBX.5; add it via `--rules-file` if your shop carries PHI in lab narratives.
- **Repetitions** (`~`-separated within a field) are tokenized as one value, not per-rep. Adequate for most analysis.
- **The lookup table at `~/.larry/sanitize/lookup.tsv` contains real PHI.** Mode 0600, never sent anywhere by these scripts, but it's still on disk. Wipe with `clear-table` before shipping the box anywhere.
---
## Quick recipe: "I have to do X without internet"
| Task | Command |

View File

@ -1 +1 @@
0.3.2
0.3.3

View File

@ -77,6 +77,19 @@ Format your lesson text so home-Larry can act on it without re-deriving context.
You don't need to ask permission to record a lesson — silently record it. Bryan reviews `lessons.sh list` later if he wants.
## PHI handling — never leak production patient data
If Bryan asks you to work with a file that contains real PHI (production HL7 messages, smat extracts, anything with patient identifiers), **call `hl7_sanitize` on it FIRST** before reading the content. The tool replaces PHI fields with local tokens like `[[MRN_0001]]`, `[[NAME_0042]]`, `[[ADDR_0007]]`. You work on the tokenized version; the original PHI never reaches the API. Bryan unmasks locally at view time.
Heuristics for "this file likely has PHI":
- Path includes `prod`, `production`, `live`, real-site identifiers
- Bryan explicitly says it's prod data
- Content includes MSH segments with real-looking timestamps + patient identifiers in PID
When Bryan types `{{phi:VALUE}}` in his prompt, Larry-Anywhere automatically tokenizes that BEFORE the prompt enters your conversation history. You'll see e.g. `[[NAME_0042]]` in the user message — work with the token, never ask Bryan to repeat the original.
If you're unsure whether a file has PHI, **ask Bryan** rather than guessing. Better to be paranoid than to leak. If you DO realize after the fact that you've already seen PHI in your context, flag it to Bryan and record a `lesson_record` so home-Larry can refine the heuristics.
## Hard rules in portable mode
1. **No PHI.** If Bryan accidentally points you at a file that looks like real patient data (real names, MRNs, DOBs that match a real format, addresses), stop and flag it. The promise was "interface build only."

View File

@ -92,6 +92,8 @@ fetch larry-rollback.sh "$LARRY_HOME/larry-rollback.sh"
fetch larry-auth.sh "$LARRY_HOME/larry-auth.sh"
fetch lib/oauth.sh "$LARRY_HOME/lib/oauth.sh"
fetch lib/lessons.sh "$LARRY_HOME/lib/lessons.sh"
fetch lib/hl7-sanitize.sh "$LARRY_HOME/lib/hl7-sanitize.sh"
fetch lib/hl7-desanitize.sh "$LARRY_HOME/lib/hl7-desanitize.sh"
fetch lib/nc-parse.sh "$LARRY_HOME/lib/nc-parse.sh"
fetch lib/nc-inbound.sh "$LARRY_HOME/lib/nc-inbound.sh"
fetch lib/nc-make-jump.sh "$LARRY_HOME/lib/nc-make-jump.sh"

View File

@ -32,7 +32,7 @@ set -o pipefail
# ─────────────────────────────────────────────────────────────────────────────
# Config
# ─────────────────────────────────────────────────────────────────────────────
LARRY_VERSION="0.3.2"
LARRY_VERSION="0.3.3"
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
LARRY_UPDATE_URL="${LARRY_UPDATE_URL:-https://raw.githubusercontent.com/bojj27/cloverleaf-larry/main/larry.sh}"
LARRY_AGENTS_URL="${LARRY_AGENTS_URL:-https://raw.githubusercontent.com/bojj27/cloverleaf-larry/main/agents}"
@ -621,6 +621,54 @@ tool_hl7_diff() {
"$LARRY_LIB_DIR/hl7-diff.sh" "${args[@]}" 2>&1
}
# ─────────────────────────────────────────────────────────────────────────────
# PHI preprocessing — replace {{phi:VALUE}} or {{phi:CATEGORY:VALUE}} in user
# input with a local deterministic token BEFORE sending to the API. Tokens
# come from the same lookup table hl7-sanitize.sh maintains, so they correlate
# with PHI sanitized out of file/smat content.
# ─────────────────────────────────────────────────────────────────────────────
preprocess_phi_markers() {
local input="$1"
local sanitize_script="$LARRY_LIB_DIR/hl7-sanitize.sh"
[ -x "$sanitize_script" ] || { printf '%s' "$input"; return; }
# Use grep -oE to extract markers reliably across bash versions.
local markers
markers=$(printf '%s' "$input" | grep -oE '\{\{phi:[^{}]+\}\}' 2>/dev/null | sort -u)
[ -z "$markers" ] && { printf '%s' "$input"; return; }
while IFS= read -r marker; do
[ -z "$marker" ] && continue
# Strip {{phi: prefix and }} suffix
local body="${marker#\{\{phi:}"
body="${body%\}\}}"
local category="" value=""
if [[ "$body" == *:* ]] && [[ "${body%%:*}" =~ ^[A-Z][A-Z0-9_]+$ ]]; then
category="${body%%:*}"
value="${body#*:}"
else
value="$body"
fi
local args=(tokenize-value)
[ -n "$category" ] && args+=(--category "$category")
args+=("$value")
local token; token=$("$sanitize_script" "${args[@]}" 2>/dev/null)
[ -z "$token" ] && token="[[PHI_ERROR]]"
input="${input//"$marker"/"$token"}"
printf '%sphi>%s %s → %s\n' "$C_YELLOW" "$C_RESET" "$marker" "$token" >&2
done <<< "$markers"
printf '%s' "$input"
}
tool_hl7_sanitize() {
local input_path="$1" strict="${2:-0}"
_lib_err_if_missing || return
local args=()
[ "$strict" = "1" ] && args+=(--strict)
args+=("$input_path")
"$LARRY_LIB_DIR/hl7-sanitize.sh" "${args[@]}" 2>&1
}
tool_lesson_record() {
local text="$1" topic="${2:-}" site="${3:-${HCISITE:-}}" severity="${4:-info}"
_lib_err_if_missing || return
@ -719,6 +767,7 @@ execute_tool() {
"$(J '.route_test_cmd // ""')" "$(J '.ignore // "MSH.7"')" \
"$(J '.phase // "all"')" "$(J '.dry_run // 0' | sed "s/false/0/;s/true/1/")" ;;
lesson_record) tool_lesson_record "$(J '.text')" "$(J '.topic // ""')" "$(J '.site // ""')" "$(J '.severity // "info"')" ;;
hl7_sanitize) tool_hl7_sanitize "$(J '.input_path')" "$(J '.strict // 0' | sed "s/false/0/;s/true/1/")" ;;
larry_rollback_list) tool_larry_rollback_list "$(J '.session // ""')" ;;
*) echo "ERROR: unknown tool: $name" ;;
esac
@ -759,6 +808,8 @@ TOOLS_JSON='[
{"name":"lesson_record","description":"Append a lesson to local capture at $LARRY_HOME/lessons/<date>.md. Use when Bryan teaches you something new (a correction, a pattern, a quirk, a gotcha) so the home-Larry can be updated later. Lessons stay LOCAL; Bryan exports them with `lessons.sh export` and pastes back to home-Larry when he can. CALL THIS WHEN: Bryan corrects a misunderstanding, reveals a site-specific convention, points out a bug, requests a behavior change, or shares a workflow detail you should remember next time.","input_schema":{"type":"object","properties":{"text":{"type":"string","description":"The lesson content. Markdown. Include enough context that home-Larry can act on it without re-deriving."},"topic":{"type":"string","description":"Short topic tag, e.g. \"NetConfig parsing\", \"jump-thread naming\", \"site conventions\"."},"site":{"type":"string","description":"Site this lesson is scoped to, if any. Default: current $HCISITE."},"severity":{"type":"string","enum":["info","warn","fix"],"description":"info=general learning, warn=behavior I should change, fix=Bryan called out a bug."}},"required":["text"]}},
{"name":"hl7_sanitize","description":"Tokenize PHI fields in an HL7 message file. Replaces values in patient identifiers, names, DOB, addresses, phones, SSN, account numbers, providers, visit numbers, NK1/GT1/IN1 fields, etc. with deterministic local tokens like [[MRN_0001]]. Same value gets same token across the entire local lookup table, so correlation analysis still works. The token-to-original mapping NEVER leaves the client (stored at $LARRY_HOME/sanitize/lookup.tsv, mode 0600). Use this when Bryan needs you to analyze a file that has real PHI. Returns the sanitized HL7 content with tokens substituted. Bryan can desanitize the final output locally with hl7-desanitize.sh.","input_schema":{"type":"object","properties":{"input_path":{"type":"string","description":"Path to the HL7 message file to sanitize."},"strict":{"type":"integer","description":"1=also tokenize any unknown Z* segments wholesale. Default 0 (safer for legibility but might miss custom PHI in Z segments)."}},"required":["input_path"]}},
{"name":"hl7_diff","description":"HL7-aware diff between two message files (or multi-message dumps). Compares segment-by-segment, field-by-field, with component and subcomponent precision. Ignores configured fields (default MSH.7 timestamp) so timestamp-only diffs do not show up as noise. Use for regression testing between environments (e.g. test vs prod route-test outputs).","input_schema":{"type":"object","properties":{"left":{"type":"string","description":"Path to left HL7 file."},"right":{"type":"string","description":"Path to right HL7 file."},"ignore":{"type":"string","description":"Comma-separated list of fields to ignore (e.g. MSH.7,MSH.10,EVN.6). Default MSH.7."},"include":{"type":"string","description":"If set, ONLY these fields are compared (overrides ignore for that set)."},"format":{"type":"string","enum":["text","tsv","count"],"description":"text=human-readable diff, tsv=machine-parseable, count=just the difference count."}},"required":["left","right"]}},
{"name":"nc_regression","description":"End-to-end regression testing between two Cloverleaf environments. 6 phases: discover inbounds in scope, sample N messages per inbound from env-A smatdbs, run route_test on env-A, run route_test on env-B with same inputs, hl7_diff every paired output file, compile summary report. Phases 3/4 require the Cloverleaf route_test command; pass it via route_test_cmd with placeholders {THREAD} {INPUT} {OUTPUT_DIR} {HCIROOT} {HCISITE}. If route_test_cmd is empty, phases 3/4 are skipped and you can run them manually using the generated input files.","input_schema":{"type":"object","properties":{"scope":{"type":"string","description":"thread:NAME | threads:N1,N2 | site (needs site_a) | server (all sites)"},"count":{"type":"integer","description":"Messages to sample per inbound. Default 10."},"env_a":{"type":"string","description":"HCIROOT of env-A (the test/source env)."},"site_a":{"type":"string","description":"Site name on env-A. Required if scope=site."},"env_b":{"type":"string","description":"HCIROOT of env-B (the prod/target env)."},"site_b":{"type":"string","description":"Site name on env-B."},"out":{"type":"string","description":"Output root directory for inputs, outputs, diffs, and summary."},"route_test_cmd":{"type":"string","description":"Command template for invoking route_test. Use {THREAD} {INPUT} {OUTPUT_DIR} {HCIROOT} {HCISITE} as placeholders."},"ignore":{"type":"string","description":"hl7_diff ignore list. Default MSH.7."},"phase":{"type":"string","enum":["1","2","3","4","5","6","all"],"description":"Run a specific phase or all. Default all."},"dry_run":{"type":"integer","description":"1 = print what would happen, do not execute. Default 0."}},"required":["scope","env_a","env_b","out"]}}
@ -903,6 +954,14 @@ Slash commands:
/lesson <text> capture a lesson to local file (paste back to home-Larry later)
/lessons list all captured lessons (newest first)
/export dump the lesson bundle for paste-back to home-Larry
/phi <value> tokenize a PHI value locally; prints token to paste in prompts
/unmask <token> show the original PHI for a token (local only; never sent)
/tokens show the full local PHI ↔ token lookup table
PHI inline syntax in any prompt:
{{phi:VALUE}} tokenize before send; auto-detects category
{{phi:MRN:12345}} explicit category=MRN (matches sanitized data)
{{phi:NAME:JOHN SMITH}} explicit category=NAME
/redetect re-scan for HCIROOT/HCISITE/tools
/sites list site dirs under HCIROOT
/site <name> switch HCISITE for this session
@ -969,6 +1028,21 @@ main_loop() {
continue ;;
/export) [ -x "$LARRY_LIB_DIR/lessons.sh" ] && "$LARRY_LIB_DIR/lessons.sh" export || err "lessons.sh not installed"
continue ;;
/phi\ *) local val="${input#/phi }"
if [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ]; then
local token; token=$("$LARRY_LIB_DIR/hl7-sanitize.sh" tokenize-value "$val" 2>/dev/null)
[ -n "$token" ] && printf '%sphi>%s %s → %s (use this in your next prompt)\n' "$C_YELLOW" "$C_RESET" "$val" "$token" || err "phi tokenization failed"
else err "hl7-sanitize.sh not installed"; fi
continue ;;
/unmask\ *) local tok="${input#/unmask }"
if [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ]; then
local val; val=$("$LARRY_LIB_DIR/hl7-sanitize.sh" detokenize-value "$tok" 2>/dev/null)
[ -n "$val" ] && printf '%sunmask>%s %s → %s (local only; never sent to API)\n' "$C_YELLOW" "$C_RESET" "$tok" "$val" || err "no such token: $tok"
else err "hl7-sanitize.sh not installed"; fi
continue ;;
/tokens) [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ] && "$LARRY_LIB_DIR/hl7-sanitize.sh" show-table \
|| err "hl7-sanitize.sh not installed"
continue ;;
/redetect) detect_cloverleaf_env
system_prompt=$(build_system_prompt)
larry_say "re-detected. /env to view."
@ -997,6 +1071,12 @@ main_loop() {
/*) err "unknown command: $input (try /help)"; continue ;;
esac
# PHI preprocessing: replace any {{phi:VALUE}} markers with local tokens
# BEFORE the input enters conversation history and gets sent to Anthropic.
if [[ "$input" == *"{{phi:"* ]]; then
input=$(preprocess_phi_markers "$input")
fi
log_section "user"; log_append "$input"
add_user_text "$input"
agent_turn "$system_prompt" || warn "turn ended with error"

86
lib/hl7-desanitize.sh Executable file
View File

@ -0,0 +1,86 @@
#!/usr/bin/env bash
# hl7-desanitize.sh — reverse hl7-sanitize: replace [[CATEGORY_NNNN]] tokens
# with original values from $LARRY_HOME/sanitize/lookup.tsv.
#
# Use this LOCALLY ONLY — at view time, in your terminal. Never feed
# desanitized output back into Larry; that defeats the whole point.
#
# Usage:
# hl7-desanitize.sh [FILE] # read file or stdin
# hl7-desanitize.sh --table PATH # alternate table
# hl7-desanitize.sh --token [[NAME_0001]] # single token lookup
#
# Examples:
# # View Larry's sanitized output unmasked, in less:
# cat larry-output.txt | hl7-desanitize.sh | less
#
# # Quick single-token lookup:
# hl7-desanitize.sh --token "[[MRN_0001]]"
set -o pipefail
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
DEFAULT_TABLE="$LARRY_HOME/sanitize/lookup.tsv"
die() { printf 'hl7-desanitize: %s\n' "$*" >&2; exit 1; }
table="$DEFAULT_TABLE"
single_token=""
input_file=""
while [ $# -gt 0 ]; do
case "$1" in
--table) shift; table="$1" ;;
--token) shift; single_token="$1" ;;
-h|--help) sed -n '2,20p' "$0"; exit 0 ;;
-*) die "unknown flag: $1" ;;
*) input_file="$1" ;;
esac
shift
done
[ -f "$table" ] || die "no lookup table at $table (sanitize first?)"
if [ -n "$single_token" ]; then
awk -F'\t' -v t="$single_token" 'NR>1 && $1==t {print $3; found=1; exit} END{if (!found) {print "no such token: " t > "/dev/stderr"; exit 2}}' "$table"
exit $?
fi
# Build sed expression set from lookup table
# Each line: token \t category \t original
# We want: s/\[\[CATEGORY_NNNN\]\]/original/g for each
# Note: original may contain sed metacharacters; escape them.
# Read table into awk, build replacement map, walk input substituting tokens.
awk_script='
BEGIN { RS = "\n" }
NR == FNR {
# Reading table
if ($1 == "token" || $1 == "") next
# cols: 1=token, 2=category, 3=original
tokens[$1] = $3
next
}
{
line = $0
# Replace each known token in the line. Tokens look like [[X_NNNN]].
# Find all matches and substitute.
while (match(line, /\[\[[A-Z_]+_[0-9]+\]\]/)) {
tok = substr(line, RSTART, RLENGTH)
if (tok in tokens) {
# Build new line by substring substitution
line = substr(line, 1, RSTART-1) tokens[tok] substr(line, RSTART+RLENGTH)
} else {
# Unknown token — leave it, but skip past so we do not infinite-loop
placeholder = "<<<unmapped:" tok ">>>"
line = substr(line, 1, RSTART-1) placeholder substr(line, RSTART+RLENGTH)
}
}
print line
}
'
if [ -n "$input_file" ]; then
awk -F'\t' "$awk_script" "$table" "$input_file"
else
awk -F'\t' "$awk_script" "$table" /dev/stdin
fi

382
lib/hl7-sanitize.sh Executable file
View File

@ -0,0 +1,382 @@
#!/usr/bin/env bash
# hl7-sanitize.sh — tokenize PHI fields in HL7 v2 messages.
#
# Replaces PHI-likely fields with deterministic local tokens like [[MRN_0001]],
# [[NAME_0001]], etc. Same value → same token across messages, so analysis
# downstream can still correlate patients, find duplicates, etc.
#
# The token ↔ original mapping is stored LOCAL ONLY at:
# $LARRY_HOME/sanitize/lookup.tsv (mode 0600)
#
# Use hl7-desanitize.sh to reverse the operation when viewing results locally.
#
# Usage:
# hl7-sanitize.sh [FILE] # sanitize file (or stdin); writes lookup
# hl7-sanitize.sh --strict [FILE] # also tokenize unrecognized Z* segments wholesale
# hl7-sanitize.sh --no-update-table # use existing tokens; do not create new ones
# hl7-sanitize.sh --table PATH # use a different lookup table
# hl7-sanitize.sh --rules-file PATH # override the default PHI rules
#
# PHI rule format (one per line):
# SEGMENT|FIELD_NUM|CATEGORY
# e.g. PID|3|MRN
#
# Subcommands (when first arg is one of these):
# show-rules print the default PHI-field rules
# show-table print the current lookup table (PHI in clear — local only)
# clear-table wipe the table (asks for confirmation)
# count number of token entries in the table
set -o pipefail
NC_SELF="$0"
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
TABLE_DIR="$LARRY_HOME/sanitize"
DEFAULT_TABLE="$TABLE_DIR/lookup.tsv"
die() { printf 'hl7-sanitize: %s\n' "$*" >&2; exit 1; }
# ─────────────────────────────────────────────────────────────────────────────
# Default PHI rules — the standard set Bryan can override via --rules-file.
# Each rule: SEGMENT|FIELD|CATEGORY
# ─────────────────────────────────────────────────────────────────────────────
PHI_RULES_DEFAULT='# Patient identification
PID|2|ALTID
PID|3|MRN
PID|4|ALTID
PID|5|NAME
PID|6|NAME
PID|7|DOB
PID|8|SEX
PID|9|NAME
PID|11|ADDR
PID|12|REGION
PID|13|PHONE
PID|14|PHONE
PID|18|ACCT
PID|19|SSN
PID|20|LIC
PID|21|MRN
PID|29|DOD
PID|30|DEATHFLAG
# Patient visit
PV1|7|PROV
PV1|8|PROV
PV1|9|PROV
PV1|17|PROV
PV1|19|VISIT
PV1|50|VISIT
PV1|52|PROV
# Next of kin
NK1|2|NAME
NK1|3|RELATIONSHIP
NK1|4|ADDR
NK1|5|PHONE
NK1|6|PHONE
NK1|16|DOB
# Guarantor
GT1|3|NAME
GT1|4|NAME
GT1|5|ADDR
GT1|6|PHONE
GT1|7|PHONE
GT1|11|DOB
GT1|12|SSN
GT1|19|EMP
# Insurance
IN1|16|NAME
IN1|17|DOB
IN1|18|NAME
IN1|19|ADDR
IN1|20|SSN
IN1|36|INSPOL
IN1|49|INSID
# Observation / orders
OBR|10|PROV
OBR|16|PROV
OBR|32|PROV
OBX|16|PROV
DG1|3|DIAG
DG1|4|DIAG
# Order Common
ORC|10|PROV
ORC|12|PROV'
cmd_show_rules() { printf '%s\n' "$PHI_RULES_DEFAULT"; }
cmd_show_table() {
local table="${1:-$DEFAULT_TABLE}"
[ -f "$table" ] || { echo "no table at $table"; return 0; }
cat "$table"
}
cmd_clear_table() {
local table="${1:-$DEFAULT_TABLE}"
local yes=0; [ "${2:-}" = "--yes" ] && yes=1
[ -f "$table" ] || { echo "no table at $table"; return 0; }
if [ "$yes" != "1" ]; then
printf 'clear lookup table at %s? [y/N]: ' "$table"
read -r ans </dev/tty || ans=""
[[ "$ans" =~ ^[Yy]$ ]] || { echo "aborted"; return 1; }
fi
umask 077
printf 'token\tcategory\toriginal\n' > "$table"
chmod 600 "$table"
echo "cleared $table"
}
cmd_count() {
local table="${1:-$DEFAULT_TABLE}"
[ -f "$table" ] || { echo 0; return 0; }
echo $(($(wc -l < "$table") - 1))
}
# tokenize-value: take a single value (and optional category) and return the
# existing or newly-created token from the lookup table. Used by larry.sh to
# preprocess {{phi:...}} placeholders BEFORE sending to the API.
# --category X force category; default = search all categories for a match,
# fall back to "MANUAL".
cmd_tokenize_value() {
local table="$DEFAULT_TABLE"
local category=""
local value=""
while [ $# -gt 0 ]; do
case "$1" in
--category) shift; category="$1" ;;
--table) shift; table="$1" ;;
-h|--help) echo "usage: tokenize-value [--category CAT] VALUE" >&2; exit 2 ;;
*) value="$1" ;;
esac
shift
done
[ -n "$value" ] || die "tokenize-value needs a VALUE"
init_table "$table"
# Auto-category: search existing table for an exact value match.
if [ -z "$category" ]; then
local existing
existing=$(awk -F'\t' -v v="$value" 'NR>1 && $3==v { print $1; exit }' "$table")
if [ -n "$existing" ]; then
printf '%s\n' "$existing"
return 0
fi
category="MANUAL"
fi
# Look up (category, value) → existing token
local tok
tok=$(awk -F'\t' -v cat="$category" -v v="$value" 'NR>1 && $2==cat && $3==v { print $1; exit }' "$table")
if [ -n "$tok" ]; then
printf '%s\n' "$tok"
return 0
fi
# Create new
local nextnum
nextnum=$(awk -F'\t' -v cat="$category" '
NR>1 && $2==cat && match($1, /_[0-9]+\]\]$/) {
n = substr($1, RSTART+1, RLENGTH-3) + 0
if (n > max) max = n
}
END { print max+1 }
' "$table")
tok=$(printf '[[%s_%04d]]' "$category" "$nextnum")
umask 077
printf '%s\t%s\t%s\n' "$tok" "$category" "$value" >> "$table"
chmod 600 "$table" 2>/dev/null || true
printf '%s\n' "$tok"
}
# detokenize-value: reverse of tokenize-value (looks up by token, returns original).
cmd_detokenize_value() {
local table="$DEFAULT_TABLE"
local token=""
while [ $# -gt 0 ]; do
case "$1" in
--table) shift; table="$1" ;;
-h|--help) echo "usage: detokenize-value TOKEN" >&2; exit 2 ;;
*) token="$1" ;;
esac
shift
done
[ -n "$token" ] || die "detokenize-value needs a TOKEN"
[ -f "$table" ] || die "no lookup table at $table"
local val
val=$(awk -F'\t' -v t="$token" 'NR>1 && $1==t { print $3; exit }' "$table")
if [ -z "$val" ]; then
printf 'no such token: %s\n' "$token" >&2
return 2
fi
printf '%s\n' "$val"
}
init_table() {
local table="$1"
mkdir -p "$(dirname "$table")" 2>/dev/null
chmod 700 "$(dirname "$table")" 2>/dev/null || true
if [ ! -f "$table" ]; then
umask 077
printf 'token\tcategory\toriginal\n' > "$table"
chmod 600 "$table"
fi
}
# ─────────────────────────────────────────────────────────────────────────────
# Main sanitize logic — single awk pass over HL7 messages.
# ─────────────────────────────────────────────────────────────────────────────
do_sanitize() {
local input_file="$1"
local rules="$2"
local table="$3"
local strict="$4"
local update_table="$5"
init_table "$table"
# Write rules to a temp file because awk -v can't carry newlines on macOS
local rules_tmp; rules_tmp=$(mktemp)
printf '%s\n' "$rules" > "$rules_tmp"
trap 'rm -f "$rules_tmp"' RETURN
local awk_script
awk_script=$(cat <<'AWK_END'
BEGIN {
ORS = "\r"
# During BEGIN, read auxiliary files with newline separator
RS = "\n"
while ((getline line < RULES_FILE) > 0) {
gsub(/^[[:space:]]+|[[:space:]]+$/, "", line)
if (line == "" || substr(line,1,1) == "#") continue
split(line, p, "|")
if (p[1] == "" || p[2] == "" || p[3] == "") continue
rule_cat[p[1] SUBSEP (p[2]+0)] = p[3]
has_seg[p[1]] = 1
}
close(RULES_FILE)
while ((getline tline < TABLE) > 0) {
if (tline ~ /^token\t/) continue
split(tline, c, "\t")
if (c[1] == "" || c[3] == "") continue
token_for[c[2] SUBSEP c[3]] = c[1]
if (match(c[1], /_[0-9]+\]\]$/)) {
num = substr(c[1], RSTART+1, RLENGTH-3) + 0
if (num > counter[c[2]]) counter[c[2]] = num
}
}
close(TABLE)
# Switch to CR for the main input (HL7 segments)
RS = "\r"
}
function tokenize(val, cat, key, t) {
if (val == "" || val == "\"\"") return val
key = cat SUBSEP val
if (key in token_for) return token_for[key]
counter[cat]++
t = sprintf("[[%s_%04d]]", cat, counter[cat])
token_for[key] = t
if (UPDATE_TABLE == "1") {
new_entries[++n_new] = t "\t" cat "\t" val
}
return t
}
{
# Each record (a segment, separated by \r). Bytes outside segments — like
# 0x1c message separators — get printed through unchanged if we just print.
if (length($0) < 3) { print; next }
seg = substr($0, 1, 3)
# Strict mode: any Z segment we have no rules for → tokenize the whole segment body
if (STRICT == "1" && substr(seg,1,1) == "Z" && !(seg in has_seg)) {
body = substr($0, 5) # skip "Zxx|"
if (body != "") {
tok = tokenize(body, "Z" substr(seg,2,2))
print seg "|" tok
} else {
print
}
next
}
if (!(seg in has_seg)) { print; next }
is_msh = (seg == "MSH")
nf = split($0, fields, "|")
# For MSH, fields[1] is "MSH", fields[2] is the encoding chars (MSH.2).
# MSH.N → fields[N] for N >= 2; MSH.1 is the separator char (skip).
# For others, SEG.N → fields[N+1].
for (k in rule_cat) {
split(k, kp, SUBSEP)
if (kp[1] != seg) continue
fnum = kp[2] + 0
cat = rule_cat[k]
idx = is_msh ? fnum : (fnum + 1)
if (idx < 1 || idx > nf) continue
fields[idx] = tokenize(fields[idx], cat)
}
line = fields[1]
for (j=2; j<=nf; j++) line = line "|" fields[j]
print line
}
END {
if (UPDATE_TABLE == "1" && n_new > 0) {
# Use explicit \n — ORS is \r for the main HL7 input loop, not for the table.
for (i=1; i<=n_new; i++) printf "%s\n", new_entries[i] >> TABLE
close(TABLE)
}
printf "hl7-sanitize: %d new token(s) created; %d total in %s\n", \
n_new, length(token_for), TABLE > "/dev/stderr"
}
AWK_END
)
if [ -n "$input_file" ]; then
awk -v RULES_FILE="$rules_tmp" -v TABLE="$table" -v STRICT="$strict" \
-v UPDATE_TABLE="$update_table" \
"$awk_script" "$input_file"
else
awk -v RULES_FILE="$rules_tmp" -v TABLE="$table" -v STRICT="$strict" \
-v UPDATE_TABLE="$update_table" \
"$awk_script" /dev/stdin
fi
}
# ─────────────────────────────────────────────────────────────────────────────
# Dispatch
# ─────────────────────────────────────────────────────────────────────────────
SUB="${1:-}"
case "$SUB" in
show-rules) shift; cmd_show_rules ;;
show-table) shift; cmd_show_table "$@" ;;
clear-table) shift; cmd_clear_table "$@" ;;
count) shift; cmd_count "$@" ;;
tokenize-value) shift; cmd_tokenize_value "$@" ;;
detokenize-value) shift; cmd_detokenize_value "$@" ;;
-h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;;
*)
# Default = sanitize mode
input_file=""
rules="$PHI_RULES_DEFAULT"
table="$DEFAULT_TABLE"
strict=0
update_table=1
while [ $# -gt 0 ]; do
case "$1" in
--strict) strict=1 ;;
--no-update-table) update_table=0 ;;
--table) shift; table="$1" ;;
--rules-file) shift; [ -f "$1" ] || die "no such rules file: $1"; rules=$(cat "$1") ;;
-h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;;
-*) die "unknown flag: $1" ;;
*) input_file="$1" ;;
esac
shift
done
do_sanitize "$input_file" "$rules" "$table" "$strict" "$update_table"
;;
esac