v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing
Bryan's ask: use Larry on prod data without PHI ever leaving the client box.
Added:
lib/hl7-sanitize.sh — tokenize PHI fields in HL7 messages
lib/hl7-desanitize.sh — reverse op (local view-time unmask)
Tokenization model:
- Replace PHI fields with [[CATEGORY_NNNN]] tokens (MRN, NAME, DOB,
ADDR, PHONE, ACCT, SSN, PROV, VISIT, etc.)
- Same value → same token across messages (deterministic via local
lookup table; analysis can still correlate patients).
- Lookup table at $LARRY_HOME/sanitize/lookup.tsv mode 0600 — never
leaves the client.
- Default PHI rule set covers PID, PV1, NK1, GT1, IN1, OBR, OBX,
DG1, ORC; --rules-file to extend.
- --strict also tokenizes unknown Z segments wholesale.
Prompt-side preprocessing in larry.sh:
- {{phi:VALUE}} inline marker, auto-category lookup
- {{phi:CATEGORY:VALUE}} explicit category
- Replaced with the token BEFORE the user input enters conversation
history. The original never reaches the API.
- Local feedback "phi> {{phi:...}} → [[TOKEN]]" printed to terminal only.
New REPL slash commands:
/phi <value> tokenize a single value, print the token
/unmask <token> show original (local terminal only, never API)
/tokens show full PHI ↔ token lookup table
New tools in larry.sh schema:
hl7_sanitize agent can sanitize a file before reading PHI
tokenize-value / detokenize-value (subcommands of hl7-sanitize.sh)
Persona update (agents/larry.md):
- Documented PHI mode and rules for proactive sanitize-first behavior
MANUAL.md updated with the full PHI section including limitations.
Brings total native tools to 29.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
6060cd28c1
commit
b9415f3b57
88
MANUAL.md
88
MANUAL.md
@ -446,6 +446,94 @@ larry-tunnel.sh --stop
|
||||
|
||||
---
|
||||
|
||||
## PHI handling — sanitize / desanitize (`lib/hl7-sanitize.sh`, `lib/hl7-desanitize.sh`)
|
||||
|
||||
When working with prod data, tokenize PHI fields BEFORE they reach the API.
|
||||
|
||||
```bash
|
||||
# Sanitize a file: replaces PHI fields with [[CATEGORY_NNNN]] tokens.
|
||||
# Lookup table at ~/.larry/sanitize/lookup.tsv (mode 0600, never leaves the box).
|
||||
lib/hl7-sanitize.sh /opt/cloverleaf/.../some.hl7 > /tmp/sanitized.hl7
|
||||
|
||||
# Pipe an entire smat-dump through sanitize:
|
||||
lib/nc-msgs.sh ADTto_3m --limit 100 --format raw \
|
||||
| lib/hl7-sanitize.sh > /tmp/sanitized-batch.hl7
|
||||
|
||||
# Strict mode also tokenizes unknown Z-segments wholesale:
|
||||
lib/hl7-sanitize.sh --strict ./msg.hl7 > /tmp/sanitized.hl7
|
||||
|
||||
# See the current lookup table (PHI is here — DON'T share):
|
||||
lib/hl7-sanitize.sh show-table
|
||||
|
||||
# Count entries:
|
||||
lib/hl7-sanitize.sh count
|
||||
|
||||
# Clear the table (asks for confirmation):
|
||||
lib/hl7-sanitize.sh clear-table
|
||||
|
||||
# Tokenize a single value (used by Larry's {{phi:...}} preprocessor):
|
||||
lib/hl7-sanitize.sh tokenize-value --category MRN 12345
|
||||
# → [[MRN_0001]]
|
||||
|
||||
# Detokenize a single token:
|
||||
lib/hl7-sanitize.sh detokenize-value "[[MRN_0001]]"
|
||||
# → 12345
|
||||
|
||||
# Desanitize a whole document (e.g. view Larry's tokenized output unmasked, locally):
|
||||
cat larry-output.txt | lib/hl7-desanitize.sh | less
|
||||
|
||||
# Quick token lookup:
|
||||
lib/hl7-desanitize.sh --token "[[NAME_0042]]"
|
||||
|
||||
# Override default PHI rules (rule file format: SEG|FIELD|CATEGORY per line)
|
||||
lib/hl7-sanitize.sh --rules-file /tmp/my-rules.txt /tmp/msg.hl7
|
||||
```
|
||||
|
||||
### Inside Larry (the REPL)
|
||||
|
||||
```
|
||||
you> /phi 5720501458
|
||||
phi> [[MRN_0001]] (use this in your next prompt)
|
||||
|
||||
you> find messages for {{phi:MRN:5720501458}} in last 3 days
|
||||
phi> {{phi:MRN:5720501458}} → [[MRN_0001]]
|
||||
(the actual prompt sent to Anthropic has [[MRN_0001]] — the original MRN never leaves the box)
|
||||
|
||||
you> /unmask [[NAME_0042]]
|
||||
unmask> [[NAME_0042]] → MORRIS^SALLY^^^... (local only; never sent to API)
|
||||
|
||||
you> /tokens
|
||||
(prints the full PHI ↔ token lookup table — local terminal only)
|
||||
```
|
||||
|
||||
PHI inline syntax in any prompt:
|
||||
- `{{phi:VALUE}}` — tokenize before send; auto-detects category (matches existing entries)
|
||||
- `{{phi:MRN:12345}}` — explicit category=MRN (matches sanitized data)
|
||||
- `{{phi:NAME:JOHN SMITH}}` — explicit category=NAME
|
||||
|
||||
### Default PHI rule set
|
||||
|
||||
Fields tokenized by default (override with `--rules-file`):
|
||||
|
||||
```
|
||||
PID.2..7, .9, .11, .13, .14, .18, .19, .20, .21, .29, .30 (patient IDs, name, DOB, address, phone, account, SSN, license)
|
||||
PV1.7, .8, .9, .17, .19, .50, .52 (providers, visit number)
|
||||
NK1.2, .3, .4, .5, .6, .16 (next of kin)
|
||||
GT1.3, .4, .5, .6, .7, .11, .12, .19 (guarantor)
|
||||
IN1.16, .17, .18, .19, .20, .36, .49 (insurance)
|
||||
OBR.10, .16, .32 / OBX.16 / DG1.3, .4 / ORC.10, .12 (orders/observations)
|
||||
```
|
||||
|
||||
### Limitations (read these)
|
||||
|
||||
- **Your typed prompt can still leak PHI** if you don't use `{{phi:…}}` markers. Be deliberate.
|
||||
- **Custom Z segments** aren't tokenized unless `--strict` is passed (which then redacts unknown Zs wholesale).
|
||||
- **Free-text fields** (OBX.5 narratives, comments in NTE segments) can contain PHI in prose form. Default rules don't tokenize OBX.5; add it via `--rules-file` if your shop carries PHI in lab narratives.
|
||||
- **Repetitions** (`~`-separated within a field) are tokenized as one value, not per-rep. Adequate for most analysis.
|
||||
- **The lookup table at `~/.larry/sanitize/lookup.tsv` contains real PHI.** Mode 0600, never sent anywhere by these scripts, but it's still on disk. Wipe with `clear-table` before shipping the box anywhere.
|
||||
|
||||
---
|
||||
|
||||
## Quick recipe: "I have to do X without internet"
|
||||
|
||||
| Task | Command |
|
||||
|
||||
@ -77,6 +77,19 @@ Format your lesson text so home-Larry can act on it without re-deriving context.
|
||||
|
||||
You don't need to ask permission to record a lesson — silently record it. Bryan reviews `lessons.sh list` later if he wants.
|
||||
|
||||
## PHI handling — never leak production patient data
|
||||
|
||||
If Bryan asks you to work with a file that contains real PHI (production HL7 messages, smat extracts, anything with patient identifiers), **call `hl7_sanitize` on it FIRST** before reading the content. The tool replaces PHI fields with local tokens like `[[MRN_0001]]`, `[[NAME_0042]]`, `[[ADDR_0007]]`. You work on the tokenized version; the original PHI never reaches the API. Bryan unmasks locally at view time.
|
||||
|
||||
Heuristics for "this file likely has PHI":
|
||||
- Path includes `prod`, `production`, `live`, real-site identifiers
|
||||
- Bryan explicitly says it's prod data
|
||||
- Content includes MSH segments with real-looking timestamps + patient identifiers in PID
|
||||
|
||||
When Bryan types `{{phi:VALUE}}` in his prompt, Larry-Anywhere automatically tokenizes that BEFORE the prompt enters your conversation history. You'll see e.g. `[[NAME_0042]]` in the user message — work with the token, never ask Bryan to repeat the original.
|
||||
|
||||
If you're unsure whether a file has PHI, **ask Bryan** rather than guessing. Better to be paranoid than to leak. If you DO realize after the fact that you've already seen PHI in your context, flag it to Bryan and record a `lesson_record` so home-Larry can refine the heuristics.
|
||||
|
||||
## Hard rules in portable mode
|
||||
|
||||
1. **No PHI.** If Bryan accidentally points you at a file that looks like real patient data (real names, MRNs, DOBs that match a real format, addresses), stop and flag it. The promise was "interface build only."
|
||||
|
||||
@ -92,6 +92,8 @@ fetch larry-rollback.sh "$LARRY_HOME/larry-rollback.sh"
|
||||
fetch larry-auth.sh "$LARRY_HOME/larry-auth.sh"
|
||||
fetch lib/oauth.sh "$LARRY_HOME/lib/oauth.sh"
|
||||
fetch lib/lessons.sh "$LARRY_HOME/lib/lessons.sh"
|
||||
fetch lib/hl7-sanitize.sh "$LARRY_HOME/lib/hl7-sanitize.sh"
|
||||
fetch lib/hl7-desanitize.sh "$LARRY_HOME/lib/hl7-desanitize.sh"
|
||||
fetch lib/nc-parse.sh "$LARRY_HOME/lib/nc-parse.sh"
|
||||
fetch lib/nc-inbound.sh "$LARRY_HOME/lib/nc-inbound.sh"
|
||||
fetch lib/nc-make-jump.sh "$LARRY_HOME/lib/nc-make-jump.sh"
|
||||
|
||||
82
larry.sh
82
larry.sh
@ -32,7 +32,7 @@ set -o pipefail
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Config
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
LARRY_VERSION="0.3.2"
|
||||
LARRY_VERSION="0.3.3"
|
||||
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
|
||||
LARRY_UPDATE_URL="${LARRY_UPDATE_URL:-https://raw.githubusercontent.com/bojj27/cloverleaf-larry/main/larry.sh}"
|
||||
LARRY_AGENTS_URL="${LARRY_AGENTS_URL:-https://raw.githubusercontent.com/bojj27/cloverleaf-larry/main/agents}"
|
||||
@ -621,6 +621,54 @@ tool_hl7_diff() {
|
||||
"$LARRY_LIB_DIR/hl7-diff.sh" "${args[@]}" 2>&1
|
||||
}
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# PHI preprocessing — replace {{phi:VALUE}} or {{phi:CATEGORY:VALUE}} in user
|
||||
# input with a local deterministic token BEFORE sending to the API. Tokens
|
||||
# come from the same lookup table hl7-sanitize.sh maintains, so they correlate
|
||||
# with PHI sanitized out of file/smat content.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
preprocess_phi_markers() {
|
||||
local input="$1"
|
||||
local sanitize_script="$LARRY_LIB_DIR/hl7-sanitize.sh"
|
||||
[ -x "$sanitize_script" ] || { printf '%s' "$input"; return; }
|
||||
|
||||
# Use grep -oE to extract markers reliably across bash versions.
|
||||
local markers
|
||||
markers=$(printf '%s' "$input" | grep -oE '\{\{phi:[^{}]+\}\}' 2>/dev/null | sort -u)
|
||||
[ -z "$markers" ] && { printf '%s' "$input"; return; }
|
||||
|
||||
while IFS= read -r marker; do
|
||||
[ -z "$marker" ] && continue
|
||||
# Strip {{phi: prefix and }} suffix
|
||||
local body="${marker#\{\{phi:}"
|
||||
body="${body%\}\}}"
|
||||
local category="" value=""
|
||||
if [[ "$body" == *:* ]] && [[ "${body%%:*}" =~ ^[A-Z][A-Z0-9_]+$ ]]; then
|
||||
category="${body%%:*}"
|
||||
value="${body#*:}"
|
||||
else
|
||||
value="$body"
|
||||
fi
|
||||
local args=(tokenize-value)
|
||||
[ -n "$category" ] && args+=(--category "$category")
|
||||
args+=("$value")
|
||||
local token; token=$("$sanitize_script" "${args[@]}" 2>/dev/null)
|
||||
[ -z "$token" ] && token="[[PHI_ERROR]]"
|
||||
input="${input//"$marker"/"$token"}"
|
||||
printf '%sphi>%s %s → %s\n' "$C_YELLOW" "$C_RESET" "$marker" "$token" >&2
|
||||
done <<< "$markers"
|
||||
printf '%s' "$input"
|
||||
}
|
||||
|
||||
tool_hl7_sanitize() {
|
||||
local input_path="$1" strict="${2:-0}"
|
||||
_lib_err_if_missing || return
|
||||
local args=()
|
||||
[ "$strict" = "1" ] && args+=(--strict)
|
||||
args+=("$input_path")
|
||||
"$LARRY_LIB_DIR/hl7-sanitize.sh" "${args[@]}" 2>&1
|
||||
}
|
||||
|
||||
tool_lesson_record() {
|
||||
local text="$1" topic="${2:-}" site="${3:-${HCISITE:-}}" severity="${4:-info}"
|
||||
_lib_err_if_missing || return
|
||||
@ -719,6 +767,7 @@ execute_tool() {
|
||||
"$(J '.route_test_cmd // ""')" "$(J '.ignore // "MSH.7"')" \
|
||||
"$(J '.phase // "all"')" "$(J '.dry_run // 0' | sed "s/false/0/;s/true/1/")" ;;
|
||||
lesson_record) tool_lesson_record "$(J '.text')" "$(J '.topic // ""')" "$(J '.site // ""')" "$(J '.severity // "info"')" ;;
|
||||
hl7_sanitize) tool_hl7_sanitize "$(J '.input_path')" "$(J '.strict // 0' | sed "s/false/0/;s/true/1/")" ;;
|
||||
larry_rollback_list) tool_larry_rollback_list "$(J '.session // ""')" ;;
|
||||
*) echo "ERROR: unknown tool: $name" ;;
|
||||
esac
|
||||
@ -759,6 +808,8 @@ TOOLS_JSON='[
|
||||
|
||||
{"name":"lesson_record","description":"Append a lesson to local capture at $LARRY_HOME/lessons/<date>.md. Use when Bryan teaches you something new (a correction, a pattern, a quirk, a gotcha) so the home-Larry can be updated later. Lessons stay LOCAL; Bryan exports them with `lessons.sh export` and pastes back to home-Larry when he can. CALL THIS WHEN: Bryan corrects a misunderstanding, reveals a site-specific convention, points out a bug, requests a behavior change, or shares a workflow detail you should remember next time.","input_schema":{"type":"object","properties":{"text":{"type":"string","description":"The lesson content. Markdown. Include enough context that home-Larry can act on it without re-deriving."},"topic":{"type":"string","description":"Short topic tag, e.g. \"NetConfig parsing\", \"jump-thread naming\", \"site conventions\"."},"site":{"type":"string","description":"Site this lesson is scoped to, if any. Default: current $HCISITE."},"severity":{"type":"string","enum":["info","warn","fix"],"description":"info=general learning, warn=behavior I should change, fix=Bryan called out a bug."}},"required":["text"]}},
|
||||
|
||||
{"name":"hl7_sanitize","description":"Tokenize PHI fields in an HL7 message file. Replaces values in patient identifiers, names, DOB, addresses, phones, SSN, account numbers, providers, visit numbers, NK1/GT1/IN1 fields, etc. with deterministic local tokens like [[MRN_0001]]. Same value gets same token across the entire local lookup table, so correlation analysis still works. The token-to-original mapping NEVER leaves the client (stored at $LARRY_HOME/sanitize/lookup.tsv, mode 0600). Use this when Bryan needs you to analyze a file that has real PHI. Returns the sanitized HL7 content with tokens substituted. Bryan can desanitize the final output locally with hl7-desanitize.sh.","input_schema":{"type":"object","properties":{"input_path":{"type":"string","description":"Path to the HL7 message file to sanitize."},"strict":{"type":"integer","description":"1=also tokenize any unknown Z* segments wholesale. Default 0 (safer for legibility but might miss custom PHI in Z segments)."}},"required":["input_path"]}},
|
||||
|
||||
{"name":"hl7_diff","description":"HL7-aware diff between two message files (or multi-message dumps). Compares segment-by-segment, field-by-field, with component and subcomponent precision. Ignores configured fields (default MSH.7 timestamp) so timestamp-only diffs do not show up as noise. Use for regression testing between environments (e.g. test vs prod route-test outputs).","input_schema":{"type":"object","properties":{"left":{"type":"string","description":"Path to left HL7 file."},"right":{"type":"string","description":"Path to right HL7 file."},"ignore":{"type":"string","description":"Comma-separated list of fields to ignore (e.g. MSH.7,MSH.10,EVN.6). Default MSH.7."},"include":{"type":"string","description":"If set, ONLY these fields are compared (overrides ignore for that set)."},"format":{"type":"string","enum":["text","tsv","count"],"description":"text=human-readable diff, tsv=machine-parseable, count=just the difference count."}},"required":["left","right"]}},
|
||||
|
||||
{"name":"nc_regression","description":"End-to-end regression testing between two Cloverleaf environments. 6 phases: discover inbounds in scope, sample N messages per inbound from env-A smatdbs, run route_test on env-A, run route_test on env-B with same inputs, hl7_diff every paired output file, compile summary report. Phases 3/4 require the Cloverleaf route_test command; pass it via route_test_cmd with placeholders {THREAD} {INPUT} {OUTPUT_DIR} {HCIROOT} {HCISITE}. If route_test_cmd is empty, phases 3/4 are skipped and you can run them manually using the generated input files.","input_schema":{"type":"object","properties":{"scope":{"type":"string","description":"thread:NAME | threads:N1,N2 | site (needs site_a) | server (all sites)"},"count":{"type":"integer","description":"Messages to sample per inbound. Default 10."},"env_a":{"type":"string","description":"HCIROOT of env-A (the test/source env)."},"site_a":{"type":"string","description":"Site name on env-A. Required if scope=site."},"env_b":{"type":"string","description":"HCIROOT of env-B (the prod/target env)."},"site_b":{"type":"string","description":"Site name on env-B."},"out":{"type":"string","description":"Output root directory for inputs, outputs, diffs, and summary."},"route_test_cmd":{"type":"string","description":"Command template for invoking route_test. Use {THREAD} {INPUT} {OUTPUT_DIR} {HCIROOT} {HCISITE} as placeholders."},"ignore":{"type":"string","description":"hl7_diff ignore list. Default MSH.7."},"phase":{"type":"string","enum":["1","2","3","4","5","6","all"],"description":"Run a specific phase or all. Default all."},"dry_run":{"type":"integer","description":"1 = print what would happen, do not execute. Default 0."}},"required":["scope","env_a","env_b","out"]}}
|
||||
@ -903,6 +954,14 @@ Slash commands:
|
||||
/lesson <text> capture a lesson to local file (paste back to home-Larry later)
|
||||
/lessons list all captured lessons (newest first)
|
||||
/export dump the lesson bundle for paste-back to home-Larry
|
||||
/phi <value> tokenize a PHI value locally; prints token to paste in prompts
|
||||
/unmask <token> show the original PHI for a token (local only; never sent)
|
||||
/tokens show the full local PHI ↔ token lookup table
|
||||
|
||||
PHI inline syntax in any prompt:
|
||||
{{phi:VALUE}} tokenize before send; auto-detects category
|
||||
{{phi:MRN:12345}} explicit category=MRN (matches sanitized data)
|
||||
{{phi:NAME:JOHN SMITH}} explicit category=NAME
|
||||
/redetect re-scan for HCIROOT/HCISITE/tools
|
||||
/sites list site dirs under HCIROOT
|
||||
/site <name> switch HCISITE for this session
|
||||
@ -969,6 +1028,21 @@ main_loop() {
|
||||
continue ;;
|
||||
/export) [ -x "$LARRY_LIB_DIR/lessons.sh" ] && "$LARRY_LIB_DIR/lessons.sh" export || err "lessons.sh not installed"
|
||||
continue ;;
|
||||
/phi\ *) local val="${input#/phi }"
|
||||
if [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ]; then
|
||||
local token; token=$("$LARRY_LIB_DIR/hl7-sanitize.sh" tokenize-value "$val" 2>/dev/null)
|
||||
[ -n "$token" ] && printf '%sphi>%s %s → %s (use this in your next prompt)\n' "$C_YELLOW" "$C_RESET" "$val" "$token" || err "phi tokenization failed"
|
||||
else err "hl7-sanitize.sh not installed"; fi
|
||||
continue ;;
|
||||
/unmask\ *) local tok="${input#/unmask }"
|
||||
if [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ]; then
|
||||
local val; val=$("$LARRY_LIB_DIR/hl7-sanitize.sh" detokenize-value "$tok" 2>/dev/null)
|
||||
[ -n "$val" ] && printf '%sunmask>%s %s → %s (local only; never sent to API)\n' "$C_YELLOW" "$C_RESET" "$tok" "$val" || err "no such token: $tok"
|
||||
else err "hl7-sanitize.sh not installed"; fi
|
||||
continue ;;
|
||||
/tokens) [ -x "$LARRY_LIB_DIR/hl7-sanitize.sh" ] && "$LARRY_LIB_DIR/hl7-sanitize.sh" show-table \
|
||||
|| err "hl7-sanitize.sh not installed"
|
||||
continue ;;
|
||||
/redetect) detect_cloverleaf_env
|
||||
system_prompt=$(build_system_prompt)
|
||||
larry_say "re-detected. /env to view."
|
||||
@ -997,6 +1071,12 @@ main_loop() {
|
||||
/*) err "unknown command: $input (try /help)"; continue ;;
|
||||
esac
|
||||
|
||||
# PHI preprocessing: replace any {{phi:VALUE}} markers with local tokens
|
||||
# BEFORE the input enters conversation history and gets sent to Anthropic.
|
||||
if [[ "$input" == *"{{phi:"* ]]; then
|
||||
input=$(preprocess_phi_markers "$input")
|
||||
fi
|
||||
|
||||
log_section "user"; log_append "$input"
|
||||
add_user_text "$input"
|
||||
agent_turn "$system_prompt" || warn "turn ended with error"
|
||||
|
||||
86
lib/hl7-desanitize.sh
Executable file
86
lib/hl7-desanitize.sh
Executable file
@ -0,0 +1,86 @@
|
||||
#!/usr/bin/env bash
|
||||
# hl7-desanitize.sh — reverse hl7-sanitize: replace [[CATEGORY_NNNN]] tokens
|
||||
# with original values from $LARRY_HOME/sanitize/lookup.tsv.
|
||||
#
|
||||
# Use this LOCALLY ONLY — at view time, in your terminal. Never feed
|
||||
# desanitized output back into Larry; that defeats the whole point.
|
||||
#
|
||||
# Usage:
|
||||
# hl7-desanitize.sh [FILE] # read file or stdin
|
||||
# hl7-desanitize.sh --table PATH # alternate table
|
||||
# hl7-desanitize.sh --token [[NAME_0001]] # single token lookup
|
||||
#
|
||||
# Examples:
|
||||
# # View Larry's sanitized output unmasked, in less:
|
||||
# cat larry-output.txt | hl7-desanitize.sh | less
|
||||
#
|
||||
# # Quick single-token lookup:
|
||||
# hl7-desanitize.sh --token "[[MRN_0001]]"
|
||||
set -o pipefail
|
||||
|
||||
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
|
||||
DEFAULT_TABLE="$LARRY_HOME/sanitize/lookup.tsv"
|
||||
|
||||
die() { printf 'hl7-desanitize: %s\n' "$*" >&2; exit 1; }
|
||||
|
||||
table="$DEFAULT_TABLE"
|
||||
single_token=""
|
||||
input_file=""
|
||||
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--table) shift; table="$1" ;;
|
||||
--token) shift; single_token="$1" ;;
|
||||
-h|--help) sed -n '2,20p' "$0"; exit 0 ;;
|
||||
-*) die "unknown flag: $1" ;;
|
||||
*) input_file="$1" ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
[ -f "$table" ] || die "no lookup table at $table (sanitize first?)"
|
||||
|
||||
if [ -n "$single_token" ]; then
|
||||
awk -F'\t' -v t="$single_token" 'NR>1 && $1==t {print $3; found=1; exit} END{if (!found) {print "no such token: " t > "/dev/stderr"; exit 2}}' "$table"
|
||||
exit $?
|
||||
fi
|
||||
|
||||
# Build sed expression set from lookup table
|
||||
# Each line: token \t category \t original
|
||||
# We want: s/\[\[CATEGORY_NNNN\]\]/original/g for each
|
||||
# Note: original may contain sed metacharacters; escape them.
|
||||
|
||||
# Read table into awk, build replacement map, walk input substituting tokens.
|
||||
awk_script='
|
||||
BEGIN { RS = "\n" }
|
||||
NR == FNR {
|
||||
# Reading table
|
||||
if ($1 == "token" || $1 == "") next
|
||||
# cols: 1=token, 2=category, 3=original
|
||||
tokens[$1] = $3
|
||||
next
|
||||
}
|
||||
{
|
||||
line = $0
|
||||
# Replace each known token in the line. Tokens look like [[X_NNNN]].
|
||||
# Find all matches and substitute.
|
||||
while (match(line, /\[\[[A-Z_]+_[0-9]+\]\]/)) {
|
||||
tok = substr(line, RSTART, RLENGTH)
|
||||
if (tok in tokens) {
|
||||
# Build new line by substring substitution
|
||||
line = substr(line, 1, RSTART-1) tokens[tok] substr(line, RSTART+RLENGTH)
|
||||
} else {
|
||||
# Unknown token — leave it, but skip past so we do not infinite-loop
|
||||
placeholder = "<<<unmapped:" tok ">>>"
|
||||
line = substr(line, 1, RSTART-1) placeholder substr(line, RSTART+RLENGTH)
|
||||
}
|
||||
}
|
||||
print line
|
||||
}
|
||||
'
|
||||
|
||||
if [ -n "$input_file" ]; then
|
||||
awk -F'\t' "$awk_script" "$table" "$input_file"
|
||||
else
|
||||
awk -F'\t' "$awk_script" "$table" /dev/stdin
|
||||
fi
|
||||
382
lib/hl7-sanitize.sh
Executable file
382
lib/hl7-sanitize.sh
Executable file
@ -0,0 +1,382 @@
|
||||
#!/usr/bin/env bash
|
||||
# hl7-sanitize.sh — tokenize PHI fields in HL7 v2 messages.
|
||||
#
|
||||
# Replaces PHI-likely fields with deterministic local tokens like [[MRN_0001]],
|
||||
# [[NAME_0001]], etc. Same value → same token across messages, so analysis
|
||||
# downstream can still correlate patients, find duplicates, etc.
|
||||
#
|
||||
# The token ↔ original mapping is stored LOCAL ONLY at:
|
||||
# $LARRY_HOME/sanitize/lookup.tsv (mode 0600)
|
||||
#
|
||||
# Use hl7-desanitize.sh to reverse the operation when viewing results locally.
|
||||
#
|
||||
# Usage:
|
||||
# hl7-sanitize.sh [FILE] # sanitize file (or stdin); writes lookup
|
||||
# hl7-sanitize.sh --strict [FILE] # also tokenize unrecognized Z* segments wholesale
|
||||
# hl7-sanitize.sh --no-update-table # use existing tokens; do not create new ones
|
||||
# hl7-sanitize.sh --table PATH # use a different lookup table
|
||||
# hl7-sanitize.sh --rules-file PATH # override the default PHI rules
|
||||
#
|
||||
# PHI rule format (one per line):
|
||||
# SEGMENT|FIELD_NUM|CATEGORY
|
||||
# e.g. PID|3|MRN
|
||||
#
|
||||
# Subcommands (when first arg is one of these):
|
||||
# show-rules print the default PHI-field rules
|
||||
# show-table print the current lookup table (PHI in clear — local only)
|
||||
# clear-table wipe the table (asks for confirmation)
|
||||
# count number of token entries in the table
|
||||
set -o pipefail
|
||||
|
||||
NC_SELF="$0"
|
||||
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
|
||||
TABLE_DIR="$LARRY_HOME/sanitize"
|
||||
DEFAULT_TABLE="$TABLE_DIR/lookup.tsv"
|
||||
|
||||
die() { printf 'hl7-sanitize: %s\n' "$*" >&2; exit 1; }
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Default PHI rules — the standard set Bryan can override via --rules-file.
|
||||
# Each rule: SEGMENT|FIELD|CATEGORY
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
PHI_RULES_DEFAULT='# Patient identification
|
||||
PID|2|ALTID
|
||||
PID|3|MRN
|
||||
PID|4|ALTID
|
||||
PID|5|NAME
|
||||
PID|6|NAME
|
||||
PID|7|DOB
|
||||
PID|8|SEX
|
||||
PID|9|NAME
|
||||
PID|11|ADDR
|
||||
PID|12|REGION
|
||||
PID|13|PHONE
|
||||
PID|14|PHONE
|
||||
PID|18|ACCT
|
||||
PID|19|SSN
|
||||
PID|20|LIC
|
||||
PID|21|MRN
|
||||
PID|29|DOD
|
||||
PID|30|DEATHFLAG
|
||||
# Patient visit
|
||||
PV1|7|PROV
|
||||
PV1|8|PROV
|
||||
PV1|9|PROV
|
||||
PV1|17|PROV
|
||||
PV1|19|VISIT
|
||||
PV1|50|VISIT
|
||||
PV1|52|PROV
|
||||
# Next of kin
|
||||
NK1|2|NAME
|
||||
NK1|3|RELATIONSHIP
|
||||
NK1|4|ADDR
|
||||
NK1|5|PHONE
|
||||
NK1|6|PHONE
|
||||
NK1|16|DOB
|
||||
# Guarantor
|
||||
GT1|3|NAME
|
||||
GT1|4|NAME
|
||||
GT1|5|ADDR
|
||||
GT1|6|PHONE
|
||||
GT1|7|PHONE
|
||||
GT1|11|DOB
|
||||
GT1|12|SSN
|
||||
GT1|19|EMP
|
||||
# Insurance
|
||||
IN1|16|NAME
|
||||
IN1|17|DOB
|
||||
IN1|18|NAME
|
||||
IN1|19|ADDR
|
||||
IN1|20|SSN
|
||||
IN1|36|INSPOL
|
||||
IN1|49|INSID
|
||||
# Observation / orders
|
||||
OBR|10|PROV
|
||||
OBR|16|PROV
|
||||
OBR|32|PROV
|
||||
OBX|16|PROV
|
||||
DG1|3|DIAG
|
||||
DG1|4|DIAG
|
||||
# Order Common
|
||||
ORC|10|PROV
|
||||
ORC|12|PROV'
|
||||
|
||||
cmd_show_rules() { printf '%s\n' "$PHI_RULES_DEFAULT"; }
|
||||
|
||||
cmd_show_table() {
|
||||
local table="${1:-$DEFAULT_TABLE}"
|
||||
[ -f "$table" ] || { echo "no table at $table"; return 0; }
|
||||
cat "$table"
|
||||
}
|
||||
|
||||
cmd_clear_table() {
|
||||
local table="${1:-$DEFAULT_TABLE}"
|
||||
local yes=0; [ "${2:-}" = "--yes" ] && yes=1
|
||||
[ -f "$table" ] || { echo "no table at $table"; return 0; }
|
||||
if [ "$yes" != "1" ]; then
|
||||
printf 'clear lookup table at %s? [y/N]: ' "$table"
|
||||
read -r ans </dev/tty || ans=""
|
||||
[[ "$ans" =~ ^[Yy]$ ]] || { echo "aborted"; return 1; }
|
||||
fi
|
||||
umask 077
|
||||
printf 'token\tcategory\toriginal\n' > "$table"
|
||||
chmod 600 "$table"
|
||||
echo "cleared $table"
|
||||
}
|
||||
|
||||
cmd_count() {
|
||||
local table="${1:-$DEFAULT_TABLE}"
|
||||
[ -f "$table" ] || { echo 0; return 0; }
|
||||
echo $(($(wc -l < "$table") - 1))
|
||||
}
|
||||
|
||||
# tokenize-value: take a single value (and optional category) and return the
|
||||
# existing or newly-created token from the lookup table. Used by larry.sh to
|
||||
# preprocess {{phi:...}} placeholders BEFORE sending to the API.
|
||||
# --category X force category; default = search all categories for a match,
|
||||
# fall back to "MANUAL".
|
||||
cmd_tokenize_value() {
|
||||
local table="$DEFAULT_TABLE"
|
||||
local category=""
|
||||
local value=""
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--category) shift; category="$1" ;;
|
||||
--table) shift; table="$1" ;;
|
||||
-h|--help) echo "usage: tokenize-value [--category CAT] VALUE" >&2; exit 2 ;;
|
||||
*) value="$1" ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
[ -n "$value" ] || die "tokenize-value needs a VALUE"
|
||||
|
||||
init_table "$table"
|
||||
|
||||
# Auto-category: search existing table for an exact value match.
|
||||
if [ -z "$category" ]; then
|
||||
local existing
|
||||
existing=$(awk -F'\t' -v v="$value" 'NR>1 && $3==v { print $1; exit }' "$table")
|
||||
if [ -n "$existing" ]; then
|
||||
printf '%s\n' "$existing"
|
||||
return 0
|
||||
fi
|
||||
category="MANUAL"
|
||||
fi
|
||||
|
||||
# Look up (category, value) → existing token
|
||||
local tok
|
||||
tok=$(awk -F'\t' -v cat="$category" -v v="$value" 'NR>1 && $2==cat && $3==v { print $1; exit }' "$table")
|
||||
if [ -n "$tok" ]; then
|
||||
printf '%s\n' "$tok"
|
||||
return 0
|
||||
fi
|
||||
|
||||
# Create new
|
||||
local nextnum
|
||||
nextnum=$(awk -F'\t' -v cat="$category" '
|
||||
NR>1 && $2==cat && match($1, /_[0-9]+\]\]$/) {
|
||||
n = substr($1, RSTART+1, RLENGTH-3) + 0
|
||||
if (n > max) max = n
|
||||
}
|
||||
END { print max+1 }
|
||||
' "$table")
|
||||
tok=$(printf '[[%s_%04d]]' "$category" "$nextnum")
|
||||
umask 077
|
||||
printf '%s\t%s\t%s\n' "$tok" "$category" "$value" >> "$table"
|
||||
chmod 600 "$table" 2>/dev/null || true
|
||||
printf '%s\n' "$tok"
|
||||
}
|
||||
|
||||
# detokenize-value: reverse of tokenize-value (looks up by token, returns original).
|
||||
cmd_detokenize_value() {
|
||||
local table="$DEFAULT_TABLE"
|
||||
local token=""
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--table) shift; table="$1" ;;
|
||||
-h|--help) echo "usage: detokenize-value TOKEN" >&2; exit 2 ;;
|
||||
*) token="$1" ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
[ -n "$token" ] || die "detokenize-value needs a TOKEN"
|
||||
[ -f "$table" ] || die "no lookup table at $table"
|
||||
local val
|
||||
val=$(awk -F'\t' -v t="$token" 'NR>1 && $1==t { print $3; exit }' "$table")
|
||||
if [ -z "$val" ]; then
|
||||
printf 'no such token: %s\n' "$token" >&2
|
||||
return 2
|
||||
fi
|
||||
printf '%s\n' "$val"
|
||||
}
|
||||
|
||||
init_table() {
|
||||
local table="$1"
|
||||
mkdir -p "$(dirname "$table")" 2>/dev/null
|
||||
chmod 700 "$(dirname "$table")" 2>/dev/null || true
|
||||
if [ ! -f "$table" ]; then
|
||||
umask 077
|
||||
printf 'token\tcategory\toriginal\n' > "$table"
|
||||
chmod 600 "$table"
|
||||
fi
|
||||
}
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Main sanitize logic — single awk pass over HL7 messages.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
do_sanitize() {
|
||||
local input_file="$1"
|
||||
local rules="$2"
|
||||
local table="$3"
|
||||
local strict="$4"
|
||||
local update_table="$5"
|
||||
|
||||
init_table "$table"
|
||||
|
||||
# Write rules to a temp file because awk -v can't carry newlines on macOS
|
||||
local rules_tmp; rules_tmp=$(mktemp)
|
||||
printf '%s\n' "$rules" > "$rules_tmp"
|
||||
trap 'rm -f "$rules_tmp"' RETURN
|
||||
|
||||
local awk_script
|
||||
awk_script=$(cat <<'AWK_END'
|
||||
BEGIN {
|
||||
ORS = "\r"
|
||||
# During BEGIN, read auxiliary files with newline separator
|
||||
RS = "\n"
|
||||
while ((getline line < RULES_FILE) > 0) {
|
||||
gsub(/^[[:space:]]+|[[:space:]]+$/, "", line)
|
||||
if (line == "" || substr(line,1,1) == "#") continue
|
||||
split(line, p, "|")
|
||||
if (p[1] == "" || p[2] == "" || p[3] == "") continue
|
||||
rule_cat[p[1] SUBSEP (p[2]+0)] = p[3]
|
||||
has_seg[p[1]] = 1
|
||||
}
|
||||
close(RULES_FILE)
|
||||
|
||||
while ((getline tline < TABLE) > 0) {
|
||||
if (tline ~ /^token\t/) continue
|
||||
split(tline, c, "\t")
|
||||
if (c[1] == "" || c[3] == "") continue
|
||||
token_for[c[2] SUBSEP c[3]] = c[1]
|
||||
if (match(c[1], /_[0-9]+\]\]$/)) {
|
||||
num = substr(c[1], RSTART+1, RLENGTH-3) + 0
|
||||
if (num > counter[c[2]]) counter[c[2]] = num
|
||||
}
|
||||
}
|
||||
close(TABLE)
|
||||
|
||||
# Switch to CR for the main input (HL7 segments)
|
||||
RS = "\r"
|
||||
}
|
||||
|
||||
function tokenize(val, cat, key, t) {
|
||||
if (val == "" || val == "\"\"") return val
|
||||
key = cat SUBSEP val
|
||||
if (key in token_for) return token_for[key]
|
||||
counter[cat]++
|
||||
t = sprintf("[[%s_%04d]]", cat, counter[cat])
|
||||
token_for[key] = t
|
||||
if (UPDATE_TABLE == "1") {
|
||||
new_entries[++n_new] = t "\t" cat "\t" val
|
||||
}
|
||||
return t
|
||||
}
|
||||
|
||||
{
|
||||
# Each record (a segment, separated by \r). Bytes outside segments — like
|
||||
# 0x1c message separators — get printed through unchanged if we just print.
|
||||
if (length($0) < 3) { print; next }
|
||||
seg = substr($0, 1, 3)
|
||||
|
||||
# Strict mode: any Z segment we have no rules for → tokenize the whole segment body
|
||||
if (STRICT == "1" && substr(seg,1,1) == "Z" && !(seg in has_seg)) {
|
||||
body = substr($0, 5) # skip "Zxx|"
|
||||
if (body != "") {
|
||||
tok = tokenize(body, "Z" substr(seg,2,2))
|
||||
print seg "|" tok
|
||||
} else {
|
||||
print
|
||||
}
|
||||
next
|
||||
}
|
||||
|
||||
if (!(seg in has_seg)) { print; next }
|
||||
|
||||
is_msh = (seg == "MSH")
|
||||
nf = split($0, fields, "|")
|
||||
|
||||
# For MSH, fields[1] is "MSH", fields[2] is the encoding chars (MSH.2).
|
||||
# MSH.N → fields[N] for N >= 2; MSH.1 is the separator char (skip).
|
||||
# For others, SEG.N → fields[N+1].
|
||||
for (k in rule_cat) {
|
||||
split(k, kp, SUBSEP)
|
||||
if (kp[1] != seg) continue
|
||||
fnum = kp[2] + 0
|
||||
cat = rule_cat[k]
|
||||
idx = is_msh ? fnum : (fnum + 1)
|
||||
if (idx < 1 || idx > nf) continue
|
||||
fields[idx] = tokenize(fields[idx], cat)
|
||||
}
|
||||
line = fields[1]
|
||||
for (j=2; j<=nf; j++) line = line "|" fields[j]
|
||||
print line
|
||||
}
|
||||
|
||||
END {
|
||||
if (UPDATE_TABLE == "1" && n_new > 0) {
|
||||
# Use explicit \n — ORS is \r for the main HL7 input loop, not for the table.
|
||||
for (i=1; i<=n_new; i++) printf "%s\n", new_entries[i] >> TABLE
|
||||
close(TABLE)
|
||||
}
|
||||
printf "hl7-sanitize: %d new token(s) created; %d total in %s\n", \
|
||||
n_new, length(token_for), TABLE > "/dev/stderr"
|
||||
}
|
||||
AWK_END
|
||||
)
|
||||
|
||||
if [ -n "$input_file" ]; then
|
||||
awk -v RULES_FILE="$rules_tmp" -v TABLE="$table" -v STRICT="$strict" \
|
||||
-v UPDATE_TABLE="$update_table" \
|
||||
"$awk_script" "$input_file"
|
||||
else
|
||||
awk -v RULES_FILE="$rules_tmp" -v TABLE="$table" -v STRICT="$strict" \
|
||||
-v UPDATE_TABLE="$update_table" \
|
||||
"$awk_script" /dev/stdin
|
||||
fi
|
||||
}
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Dispatch
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
SUB="${1:-}"
|
||||
case "$SUB" in
|
||||
show-rules) shift; cmd_show_rules ;;
|
||||
show-table) shift; cmd_show_table "$@" ;;
|
||||
clear-table) shift; cmd_clear_table "$@" ;;
|
||||
count) shift; cmd_count "$@" ;;
|
||||
tokenize-value) shift; cmd_tokenize_value "$@" ;;
|
||||
detokenize-value) shift; cmd_detokenize_value "$@" ;;
|
||||
-h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;;
|
||||
*)
|
||||
# Default = sanitize mode
|
||||
input_file=""
|
||||
rules="$PHI_RULES_DEFAULT"
|
||||
table="$DEFAULT_TABLE"
|
||||
strict=0
|
||||
update_table=1
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--strict) strict=1 ;;
|
||||
--no-update-table) update_table=0 ;;
|
||||
--table) shift; table="$1" ;;
|
||||
--rules-file) shift; [ -f "$1" ] || die "no such rules file: $1"; rules=$(cat "$1") ;;
|
||||
-h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;;
|
||||
-*) die "unknown flag: $1" ;;
|
||||
*) input_file="$1" ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
do_sanitize "$input_file" "$rules" "$table" "$strict" "$update_table"
|
||||
;;
|
||||
esac
|
||||
Loading…
Reference in New Issue
Block a user