v0.7.3: automatic PHI detection (tiered detection + blacklist contexts)

Adds automatic PHI tokenization on two surfaces: user input and HL7-shaped
tool results. Supersedes Bryan's reverted af2ffe8 prototype with a tiered
confidence model, explicit blacklist contexts, structured audit log, and
tool-result coverage.

Bryan's directive: "Err on the side of caution and tokenize anything you
think you may need to as long as it doesn't break the tools." Priority
order: (1) don't break tools (constraint), (2) catch all PHI (goal),
(3) minimize false positives (secondary).

Detection — four-tier model (first match wins per token):

  Tier 1 DEFINITE   SSN (with dashes), email, formatted phone, NPI with
                    explicit "NPI:" prefix. Always tokenize.
  Tier 2 CONTEXTUAL Numeric value preceded by MRN/Patient/DOB/Account/
                    Visit/Acct/Record/Birth within 20 chars. Always.
  Tier 3 HL7-CTX    Plausibly-PHI-shaped values when line mentions
                    PID.3/5/7/11/13/18, NK1.*, GT1.*, IN1.16-20.
                    Aggressive — prompts in confirm mode.
  Tier 4 KNOWN      Value already exists in $LARRY_HOME/sanitize/lookup.tsv.
                    Tier-4 scans the full set of categories actually present
                    in the table (not a hardcoded shortlist), so any
                    category Bryan has used before is checked.

Blacklist contexts (NEVER tokenize, even on tier match):
  * Path-like (/, ./, ../, ~/, contains /)
  * HL7 field references like PID.18 — the digit after the dot is a
    field index, not an MRN (spec verification scenario #5)
  * Version strings (vN.N.N, semver) and ISO dates (overridden by
    explicit DOB/Birth context so "DOB 1980-01-15" still tokenizes)
  * Port keywords (:NNNN, port NNNN, tcp/udp NNNN, LISTEN/PORT=)
  * Error/status codes (error NNN, code NNN, HTTP NNN, rc=N)
  * JSON key position (value followed by ": or :)
  * Fenced code blocks (``` ... ``` skipped via awk redactor)
  * Timestamps (epoch ms 13+ digits, epoch s 10 digits starting 1)

Tool-result surface — routed through hl7-sanitize.sh:
  * Eligible tools: read_file (.hl7/.HL7/.txt/.TXT only), nc_msgs,
    hl7_field, hl7_diff
  * Eligibility further gated by _auto_phi_looks_like_hl7 shape check
    (segment headers MSH/PID/EVN/PV1 with | delimiter)
  * Generic outputs (list_dir, grep_files, bash_exec, glob_files, ssh_exec,
    web search) NEVER scanned — spec is explicit about this
  * For HL7-shaped content we use the canonical field-aware pipeline
    rather than the prose detector, since segments are pipe-delimited
    and would otherwise be a single whitespace token. Both pipelines
    share lookup.tsv so tokens are stable across surfaces.

Behavior controls:
  * env LARRY_AUTO_PHI: 1/on (default), 0/off, confirm
  * /phi-auto on|off|confirm|status slash command
  * "!nophi " per-turn prefix override
  * Manual @@VALUE / {{phi:VALUE}} markers always win — preprocessed
    FIRST; auto-PHI fills gaps in things Bryan didn't manually mark.
  * After each pass, dim status line summarises:
      phi> auto-tokenized 3 value(s) [user_input]: MRN×1 EMAIL×1 SSN×1

Audit — JSONL log at $LARRY_HOME/log/auto-phi.log:
  { "ts": "...", "value": "...", "category": "...", "token": "...",
    "tier": "definite|contextual|hl7|known|hl7_pipeline",
    "surface": "user_input|tool_result", "context": "..." }
  Mode 0600, parent dir 0700. Best-effort write; never fails the host call.

Library changes (lib/hl7-sanitize.sh):
  * normalize_value: re-add EMAIL + PHONE arms + new NPI arm. EMAIL and
    PHONE arms were originally in af2ffe8 (reverted with v0.7.1) — cited
    in the source comments.
  * normalize-value subcommand: exposes canonical normalization so auto-PHI
    can build per-session memory keys. Originally af2ffe8.
  * lookup-original subcommand: probes the table for an exact match without
    creating new tokens. Used by Tier-4 "already-known" detection.

Implementation notes:
  * macOS bash 3.2 compatibility: ${pos: -20} returns empty when len < 20;
    use explicit ${pos:$((len-20))} guarded by length check.
  * Per-session decision cache (accept/decline) uses bash 4 associative
    arrays with a 3.2 fallback to pipe-delimited string membership.
  * Confirm-mode prompts only Tier 3-4 — Tier 1-2 hits are high-confidence
    and always tokenize even in confirm mode (Bryan: err on caution).
  * Detection loop iterates line-by-line so fenced-code redaction works
    and so left/right context is meaningful per token.

Verification matrix (18/18 pass):
  1 SSN tokenized, 2 Email tokenized, 3 MRN contextual,
  4 bare digits skipped, 5 PID.18 skipped, 6 path skipped,
  7 version skipped, 8 port skipped, 9 Tier-4 known catches custom
  category (EMP), 10 !nophi skips, 11 existing token left alone,
  12 read_file .hl7 sanitizes all PHI fields, 13 .py not HL7-shaped,
  14 list_dir not HL7-shaped, 15 mode=off skips, 16a /phi-auto off
  skips, 16b /phi-auto on tokenizes, 17 audit JSONL parseable.

No regressions to v0.7.2 origin switching, v0.7.1 status-line position,
v0.7.0 HL7 completion + mouse mode, v0.6.9 status state, v0.6.7 streaming,
or any earlier OAuth/SSH/lessons work. MANIFEST unchanged.

Divergence from af2ffe8 (cited in source comments):
  * Tiered classifier (vs. flat regex set) — enables reasoning about WHY
    a value tokenized; gates confirm-mode behavior.
  * Explicit blacklist contexts — addresses spec false-positive cases
    that af2ffe8 missed (HL7 field refs, ports, error codes, JSON keys).
  * Tool-result surface — af2ffe8 only ran on user input.
  * Structured JSONL audit log — af2ffe8 had no per-tokenization log.
  * /phi-auto semantics: on|off|confirm|status (spec) vs. af2ffe8's
    /auto-phi on|off|aggressive|confirm.
  * Dropped the loose "Title Case Title Case" pair detector and its
    name-allowlist — too high FP rate against narrative prose
    ("Larry Anywhere", "Mac Studio") and Bryan's name-allowlist couldn't
    keep up with the long tail. Name detection now Tier-3 (HL7-context
    only) and Tier-4 (already-known) only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Bryan Johnson 2026-05-27 17:37:26 -07:00
parent 81c4875ecf
commit 58e6bf4e03
3 changed files with 755 additions and 2 deletions

View File

@ -1 +1 @@
0.7.2 0.7.3

706
larry.sh
View File

@ -54,7 +54,7 @@ set -o pipefail
# ───────────────────────────────────────────────────────────────────────────── # ─────────────────────────────────────────────────────────────────────────────
# Config # Config
# ───────────────────────────────────────────────────────────────────────────── # ─────────────────────────────────────────────────────────────────────────────
LARRY_VERSION="0.7.2" LARRY_VERSION="0.7.3"
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}" LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
# ───────────────────────────────────────────────────────────────────────────── # ─────────────────────────────────────────────────────────────────────────────
@ -980,6 +980,597 @@ preprocess_phi_markers() {
printf '%s' "$input" printf '%s' "$input"
} }
# ─────────────────────────────────────────────────────────────────────────────
# v0.7.3 — Automatic PHI detection (supersedes the af2ffe8 prototype).
#
# Background
# ----------
# Bryan's directive: "Err on the side of caution and tokenize anything you
# think you may need to as long as it doesn't break the tools." Priorities,
# in order:
# 1. DO NOT break tools (constraint)
# 2. Catch all PHI (goal)
# 3. Minimize false positives (nice-to-have, secondary)
#
# Reference implementation: commit af2ffe8 (reverted with v0.7.1) — Bryan's
# own first pass. v0.7.3 supersedes it with:
# * Four-tier confidence model (vs. af2ffe8's flat regex set) so we can
# reason about WHY a value tokenizes and gate behavior accordingly.
# * Explicit blacklist contexts (path-like, HL7 field refs like PID.18,
# version strings, port keywords, error/status codes, JSON keys, fenced
# code) — addresses the false-positive failure modes the spec calls out.
# * Tool-result surface: HL7-shaped tool outputs (read_file of .hl7, .txt
# with segments; nc_msgs output) get sanitized BEFORE entering message
# history. Generic outputs (list_dir, grep_files, web search) are NOT
# touched — the spec is explicit about this.
# * Structured JSONL audit log at $LARRY_HOME/log/auto-phi.log.
# * `/phi-auto on|off|confirm|status` slash command + LARRY_AUTO_PHI env.
# * Per-turn override `!nophi `.
#
# Surfaces
# --------
# 1. USER INPUT: invoked AFTER preprocess_phi_markers in main_loop, so any
# explicit @@VALUE / {{phi:V}} markers are already tokenized. Auto-PHI
# fills the gaps Bryan didn't manually mark.
# 2. TOOL RESULTS: invoked from the tool dispatch path (agent_turn) when the
# result text looks like HL7 (contains `\rMSH|` or starts with `MSH|`).
#
# Detection tiers (first match wins per token)
# --------------------------------------------
# Tier 1 DEFINITE SSN, email, phone (formatted), NPI (with context).
# High confidence. Always tokenize.
# Tier 2 CONTEXTUAL Numeric value immediately preceded by an MRN/Patient/
# DOB/Account/Visit/Acct/Record/Birth keyword (within 5
# chars). Always tokenize.
# Tier 3 HL7-CTX When the line/paragraph mentions a known-PHI HL7 field
# ref (PID.3, PID.5, PID.7, PID.11, PID.13, PID.18,
# NK1.*, GT1.*), be aggressive about plausibly-PHI-shaped
# values in the same line. Tokenize.
# Tier 4 KNOWN Value already exists in $LARRY_HOME/sanitize/lookup.tsv
# (Bryan has tokenized this exact value before). Always
# tokenize, reusing the existing token.
#
# Blacklist contexts (NEVER tokenize, even if a tier matches)
# -----------------------------------------------------------
# * Path-like: starts with /, ./, ../, ~/, contains /
# * HL7 field references THEMSELVES: [A-Z]{3}\.\d+ — the digit after the
# dot is a field index, not PHI. (Critical: spec verification #5.)
# * Version strings: vN.N.N, semver, ISO dates YYYY-MM-DD
# * Port numbers: :NNNN, port NNNN, tcp:NNNN, PROTOCOL.PORT=
# * Error / status codes: error NNN, code NNN, rc=N, HTTP NNN, status NNN
# * JSON key position: token immediately followed by ":" with a string
# value to the right (don't break tool argument JSON)
# * Fenced code: anything inside ``` ... ``` is skipped wholesale
#
# Behavior controls
# -----------------
# env LARRY_AUTO_PHI 1 (default, ON) | 0 (off) | confirm (prompt on
# Tier 3-4 matches)
# /phi-auto on|off|confirm|status
# !nophi <prompt> per-turn override (strip prefix, skip auto-PHI)
#
# After each pass, a dim status line summarises what was caught:
# phi> auto-tokenized 3 values: MRN×1 NAME×1 DOB×1
#
# Audit
# -----
# Every tokenization writes a JSONL line to $LARRY_HOME/log/auto-phi.log:
# { "ts": "...", "value": "<orig>", "category": "MRN", "token": "[[MRN_0042]]",
# "tier": "contextual", "surface": "user_input"|"tool_result", "context": "..." }
# ─────────────────────────────────────────────────────────────────────────────
# Mode resolution. Env default per spec: ON unless 0 / off.
# Accepted env values: "1" / "on" / "" → on ; "0" / "off" → off ; "confirm" → confirm.
# (aggressive accepted as an alias for "on" to preserve af2ffe8 muscle memory.)
_resolve_auto_phi_mode() {
local v="${LARRY_AUTO_PHI:-1}"
case "$v" in
0|off|OFF) printf 'off' ;;
confirm|CONFIRM) printf 'confirm' ;;
1|on|ON|aggressive|"") printf 'on' ;;
*) printf 'on' ;;
esac
}
AUTO_PHI_MODE="$(_resolve_auto_phi_mode)"
AUTO_PHI_SESSION_COUNT=0
AUTO_PHI_LOG="${LARRY_HOME}/log/auto-phi.log"
# Per-session confirm cache. Keyed on canonical-normalized form so
# "John Smith" / "JOHN SMITH" share a single decision.
if (( BASH_VERSINFO[0] >= 4 )); then
declare -A _AUTO_PHI_ACCEPTED 2>/dev/null
declare -A _AUTO_PHI_DECLINED 2>/dev/null
else
_AUTO_PHI_ACCEPTED_LIST=""
_AUTO_PHI_DECLINED_LIST=""
fi
_auto_phi_accept_check() {
local key="$1"
if (( BASH_VERSINFO[0] >= 4 )); then
[ -n "${_AUTO_PHI_ACCEPTED[$key]:-}" ]
else
[[ "|$_AUTO_PHI_ACCEPTED_LIST|" == *"|$key|"* ]]
fi
}
_auto_phi_decline_check() {
local key="$1"
if (( BASH_VERSINFO[0] >= 4 )); then
[ -n "${_AUTO_PHI_DECLINED[$key]:-}" ]
else
[[ "|$_AUTO_PHI_DECLINED_LIST|" == *"|$key|"* ]]
fi
}
_auto_phi_mark_accept() {
local key="$1"
if (( BASH_VERSINFO[0] >= 4 )); then
_AUTO_PHI_ACCEPTED[$key]=1
else
_AUTO_PHI_ACCEPTED_LIST="${_AUTO_PHI_ACCEPTED_LIST}|$key"
fi
}
_auto_phi_mark_decline() {
local key="$1"
if (( BASH_VERSINFO[0] >= 4 )); then
_AUTO_PHI_DECLINED[$key]=1
else
_AUTO_PHI_DECLINED_LIST="${_AUTO_PHI_DECLINED_LIST}|$key"
fi
}
# Emit one JSONL audit entry. All fields jq-quoted to handle PHI characters
# safely. Best-effort — never fail the host call.
_auto_phi_log() {
local value="$1" category="$2" token="$3" tier="$4" surface="$5" context="$6"
local logdir; logdir="$(dirname "$AUTO_PHI_LOG")"
mkdir -p "$logdir" 2>/dev/null
chmod 700 "$logdir" 2>/dev/null || true
local ts; ts="$(date -Iseconds 2>/dev/null || date)"
# Truncate context to ~40 chars around the hit so we don't bloat the log.
local ctx_short="${context:0:80}"
jq -cn \
--arg ts "$ts" --arg value "$value" --arg category "$category" \
--arg token "$token" --arg tier "$tier" --arg surface "$surface" \
--arg context "$ctx_short" \
'{ts:$ts,value:$value,category:$category,token:$token,tier:$tier,surface:$surface,context:$context}' \
>> "$AUTO_PHI_LOG" 2>/dev/null || true
chmod 600 "$AUTO_PHI_LOG" 2>/dev/null || true
}
# ── Blacklist guards. Return 0 (true) when the token should be SKIPPED. ───
# Most guards take (token, left_context, full_line) so we can look at the
# surroundings without re-tokenizing the whole input.
_auto_phi_skip_path_like() {
local v="$1"
case "$v" in
/*|./*|../*|~/*) return 0 ;;
[A-Z]:\\*) return 0 ;;
*/*) return 0 ;;
esac
return 1
}
# HL7 field-reference guard. The DIGIT in "PID.18" is a field index, NOT
# an MRN. We need to spot this BEFORE the contextual MRN/NPI checks fire.
# Pattern: if left_context ends with [A-Z]{3}\. immediately before our digit
# token, skip.
_auto_phi_skip_hl7_fieldref() {
local v="$1" left="$2"
[[ "$v" =~ ^[0-9]+$ ]] || return 1
[[ "$left" =~ [A-Z]{3}\.$ ]] && return 0
return 1
}
_auto_phi_skip_version() {
local v="$1"
# vN.N.N, vN.N
[[ "$v" =~ ^v[0-9]+(\.[0-9]+){1,3}$ ]] && return 0
# Bare semver N.N.N
[[ "$v" =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]] && return 0
# ISO date YYYY-MM-DD
[[ "$v" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}$ ]] && return 0
# Date-or-time-ish 8 pure digits that look like YYYYMMDD in the 1900-2099
# range. (Conservative — only block if it parses as a plausible date.)
if [[ "$v" =~ ^(19|20)[0-9]{2}(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])$ ]]; then
return 0
fi
return 1
}
_auto_phi_skip_port() {
local v="$1" left="$2"
[[ "$v" =~ ^[0-9]+$ ]] || return 1
# :NNNN
[[ "$left" =~ :$ ]] && return 0
# "port NNNN" / "tcp:NNNN" / "PROTOCOL.PORT=NNNN" / "TCP PORT"
if [[ "$left" =~ ([Pp]ort|PORT|tcp|TCP|udp|UDP|listen|LISTEN)[[:space:]:=]*$ ]]; then
return 0
fi
return 1
}
_auto_phi_skip_errcode() {
local v="$1" left="$2"
[[ "$v" =~ ^[0-9]+$ ]] || return 1
if [[ "$left" =~ ([Ee]rror|ERROR|[Cc]ode|CODE|HTTP|http|[Ss]tatus|STATUS|rc=|RC=)[[:space:]:=]*$ ]]; then
return 0
fi
return 1
}
# JSON key position: token immediately followed by `":` (or `: ` with a
# JSON-quoted value to the right). We test using the SURROUNDING line.
_auto_phi_skip_json_key() {
local v="$1" right="$2"
case "$right" in
\"*) return 0 ;;
:*) return 0 ;;
esac
return 1
}
# Timestamp guard — 13+ digit epoch milliseconds, or 10 digits starting with
# '1' (epoch seconds in the 2001-2286 range). Carry-over from af2ffe8's
# detector.
_auto_phi_skip_timestamp() {
local v="$1"
[[ "$v" =~ ^[0-9]+$ ]] || return 1
local n="${#v}"
[ "$n" -ge 13 ] && return 0
if [ "$n" -eq 10 ] && [[ "$v" == 1* ]]; then return 0; fi
return 1
}
# Already-tokenized form: [[CAT_NNNN]]. Skip — we don't double-tokenize.
_auto_phi_skip_already_token() {
[[ "$1" =~ ^\[\[[A-Z][A-Z0-9_]*_[0-9]+\]\]$ ]]
}
# Already inside a manual marker (@@…@@, @@…, {{phi:…}}). Caller can detect
# by checking that the marker has already been substituted to a token by
# preprocess_phi_markers — by the time auto_detect_phi runs, those are gone.
# This guard is defensive: skip leftover marker-shaped tokens.
_auto_phi_skip_marker_residue() {
local v="$1"
[[ "$v" == @@* ]] && return 0
[[ "$v" == *@@ ]] && return 0
[[ "$v" == \{\{phi:* ]] && return 0
return 1
}
# ── Tier classifier. Returns "TIER|CATEGORY" or empty. ──
# Args: value, left_context, right_context, full_line.
_auto_phi_classify_tiered() {
local v="$1" left="$2" right="$3" line="$4"
[ -z "$v" ] && return 0
# Universal skips (apply before any tier).
_auto_phi_skip_already_token "$v" && return 0
_auto_phi_skip_marker_residue "$v" && return 0
_auto_phi_skip_path_like "$v" && return 0
_auto_phi_skip_hl7_fieldref "$v" "$left" && return 0
_auto_phi_skip_json_key "$v" "$right" && return 0
# Version skip is overridden when a DOB/Birth keyword precedes — an ISO
# date in DOB context IS the DOB, not a version string. Err on caution.
if ! [[ "$left" =~ ([Dd][Oo][Bb]|[Bb]irth|BIRTH)[[:space:]:#=]*$ ]]; then
_auto_phi_skip_version "$v" && return 0
fi
# Strip one trailing sentence-grammar char for classification.
local trimmed="$v"
case "$trimmed" in
*[.,\;:\!\?\)]) trimmed="${trimmed%?}" ;;
esac
# URL — leave alone.
case "$trimmed" in
http://*|https://*|ssh://*|ftp://*|sftp://*|file://*|ws://*|wss://*) return 0 ;;
esac
# ── TIER 1: DEFINITE ──
# Email
if [[ "$trimmed" =~ ^[^@[:space:]]+@[^@[:space:]]+\.[^@[:space:]]+$ ]]; then
printf 'definite|EMAIL'; return
fi
# SSN with dashes (definite shape)
if [[ "$trimmed" =~ ^[0-9]{3}-[0-9]{2}-[0-9]{4}$ ]]; then
printf 'definite|SSN'; return
fi
# Phone — formatted variants ((NNN) NNN-NNNN, NNN-NNN-NNNN)
if [[ "$trimmed" =~ ^\([0-9]{3}\)[[:space:]]?[0-9]{3}-[0-9]{4}$ ]] \
|| [[ "$trimmed" =~ ^[0-9]{3}-[0-9]{3}-[0-9]{4}$ ]] \
|| [[ "$trimmed" =~ ^[0-9]{3}\.[0-9]{3}\.[0-9]{4}$ ]]; then
printf 'definite|PHONE'; return
fi
# NPI with explicit context (NPI: NNNNNNNNNN, or in PV1.7 field context)
if [[ "$trimmed" =~ ^[0-9]{10}$ ]]; then
if [[ "$left" =~ (NPI|npi)[[:space:]:=]*$ ]]; then
printf 'definite|NPI'; return
fi
fi
# Timestamp/port/errcode rejection (applies before contextual MRN check).
_auto_phi_skip_timestamp "$trimmed" && return 0
_auto_phi_skip_port "$trimmed" "$left" && return 0
_auto_phi_skip_errcode "$trimmed" "$left" && return 0
# ── TIER 2: CONTEXTUAL ──
# Numeric value preceded by an MRN/Patient/DOB/Account/Visit/Acct/
# Record/Birth keyword within 5 chars (i.e. immediately).
if [[ "$trimmed" =~ ^[0-9]+$ ]]; then
if [[ "$left" =~ ([Mm][Rr][Nn]|[Pp]atient|PATIENT|[Dd][Oo][Bb]|[Aa]ccount|ACCOUNT|[Vv]isit|VISIT|[Aa]cct|ACCT|[Rr]ecord|RECORD|[Bb]irth|BIRTH)[[:space:]:#=]*$ ]]; then
local cat
case "${BASH_REMATCH[1]}" in
DOB|dob|Dob|Birth|birth|BIRTH) cat="DOB" ;;
Account|account|ACCOUNT|Acct|acct|ACCT|Visit|visit|VISIT) cat="ACCT" ;;
*) cat="MRN" ;;
esac
printf 'contextual|%s' "$cat"; return
fi
fi
# Date-shaped value preceded by DOB/Birth → DOB tier 2.
if [[ "$trimmed" =~ ^[0-9]{1,4}[/-][0-9]{1,2}[/-][0-9]{1,4}$ ]]; then
if [[ "$left" =~ ([Dd][Oo][Bb]|[Bb]irth|BIRTH)[[:space:]:#=]*$ ]]; then
printf 'contextual|DOB'; return
fi
# Bare M/D/Y outside of DOB context is still date-shaped; treat as Tier 3
# only if HL7 PHI fields are mentioned in the line.
fi
# SSN without dashes (9 raw digits) needs context too — too easy to confuse
# with other 9-digit IDs.
if [[ "$trimmed" =~ ^[0-9]{9}$ ]]; then
if [[ "$left" =~ ([Ss][Ss][Nn]|SSN)[[:space:]:#=]*$ ]]; then
printf 'contextual|SSN'; return
fi
fi
# ── TIER 3: HL7-CONTEXT ──
# Only kicks in if the surrounding line mentions a known-PHI HL7 field ref.
local hl7_ctx=0
if [[ "$line" =~ (PID\.(3|5|7|11|13|18)|NK1\.[0-9]+|GT1\.[0-9]+|IN1\.(16|17|18|19|20)) ]]; then
hl7_ctx=1
fi
if [ "$hl7_ctx" = "1" ]; then
# HL7 caret-name (FAMILY^GIVEN^MIDDLE…)
if [[ "$trimmed" =~ ^[A-Za-z][A-Za-z\'-]*\^[A-Za-z][A-Za-z\'-]* ]]; then
printf 'hl7|NAME'; return
fi
# Date-shaped → DOB
if [[ "$trimmed" =~ ^[0-9]{1,4}[/-][0-9]{1,2}[/-][0-9]{1,4}$ ]]; then
printf 'hl7|DOB'; return
fi
# 6-12 pure digits in HL7 context → MRN
if [[ "$trimmed" =~ ^[0-9]{6,12}$ ]]; then
printf 'hl7|MRN'; return
fi
fi
return 0
}
# Tier-4 ("already-known") check: is the value already in lookup.tsv?
# Returns 0 (true) + emits "CATEGORY|TOKEN" on hit; 1 + empty on miss.
# Reads the full set of categories actually present in lookup.tsv at the
# time of the call, so user-added categories (EMP, INSPOL, etc. from the
# default PHI rules) are all considered, not just a hardcoded shortlist.
_auto_phi_check_known() {
local v="$1"
local sanitize_script="$LARRY_LIB_DIR/hl7-sanitize.sh"
[ -x "$sanitize_script" ] || return 1
local table="${LARRY_HOME}/sanitize/lookup.tsv"
[ -f "$table" ] || return 1
# Discover categories present in the table (column 2, skip header).
local cats
cats=$(awk -F'\t' 'NR>1 && $2 != "" { print $2 }' "$table" 2>/dev/null | sort -u)
[ -z "$cats" ] && return 1
local cat token
while IFS= read -r cat; do
[ -z "$cat" ] && continue
token=$("$sanitize_script" lookup-original "$v" "$cat" 2>/dev/null)
if [ -n "$token" ]; then
printf '%s|%s' "$cat" "$token"
return 0
fi
done <<< "$cats"
return 1
}
# Confirm-mode interactive Y/n. Returns 0 to proceed with tokenization, 1 to
# skip. Caches decision per-session under the normalized key.
_auto_phi_confirm() {
local value="$1" category="$2" tier="$3"
local sanitize_script="$LARRY_LIB_DIR/hl7-sanitize.sh"
local mem_key
mem_key=$("$sanitize_script" normalize-value "$value" "$category" 2>/dev/null) || mem_key="$value"
[ -z "$mem_key" ] && mem_key="$value"
if _auto_phi_decline_check "$mem_key"; then return 1; fi
if _auto_phi_accept_check "$mem_key"; then return 0; fi
# confirm mode only prompts on Tier 3 and Tier 4 per spec — Tier 1 & 2
# are always tokenized even in confirm mode.
if [ "$tier" != "hl7" ] && [ "$tier" != "known" ]; then return 0; fi
printf '%sphi auto>%s tokenize "%s" as %s? [Y/n] ' \
"$C_YELLOW" "$C_RESET" "$value" "$category" >&2
local ans=""; IFS= read -r ans </dev/tty 2>/dev/null || ans=""
case "$ans" in
n|N|no|NO|No) _auto_phi_mark_decline "$mem_key"; return 1 ;;
*) _auto_phi_mark_accept "$mem_key"; return 0 ;;
esac
}
# Strip fenced code blocks (``` … ```) from a copy of the input before
# token scanning. The caller scans the redacted version; substitutions are
# applied against the ORIGINAL input via literal string replace, so any
# values inside a fenced block remain untouched.
_auto_phi_redact_fences() {
local s="$1"
# Replace anything between triple-backticks with spaces of same length.
printf '%s' "$s" | awk '
BEGIN { in_fence=0 }
/^[[:space:]]*```/ { in_fence = !in_fence; print ""; next }
{ if (in_fence) print ""; else print }
'
}
# Detect HL7 shape — used to gate the tool-result surface.
_auto_phi_looks_like_hl7() {
local s="$1"
# Strong signal: contains a CR-separated MSH segment, or starts with MSH|.
case "$s" in
MSH\|*) return 0 ;;
esac
case "$s" in
*$'\r'MSH\|*) return 0 ;;
*$'\n'MSH\|*) return 0 ;;
esac
# Also accept "MSH|" near start (some tool outputs add a header line).
printf '%s' "$s" | head -c 4096 | grep -qE '(^|[\r\n])(MSH|PID|EVN|PV1)\|' && return 0
return 1
}
# Main detector. Args: surface ("user_input"|"tool_result"), input text.
# Echoes the rewritten input. Status message goes to stderr.
auto_detect_phi() {
local surface="$1"
local input="$2"
local sanitize_script="$LARRY_LIB_DIR/hl7-sanitize.sh"
[ -x "$sanitize_script" ] || { printf '%s' "$input"; return; }
# Per-turn override (user-input surface only).
if [ "$surface" = "user_input" ] && [[ "$input" == '!nophi '* ]]; then
printf '%s' "${input#!nophi }"
return 0
fi
if [ "$AUTO_PHI_MODE" = "off" ]; then
printf '%s' "$input"
return 0
fi
# Build a fence-redacted scan copy. Substitutions still happen on $input.
local scan; scan=$(_auto_phi_redact_fences "$input")
# Collect hits as TIER|CAT|VALUE rows. Use newline-separated for safety.
local hits=""
local line left_context right_context tok token_cat token_tier
local i ch
# Iterate line by line; tokenize by whitespace within each line so we can
# compute left/right context. This is bash-only — no awk dependency for the
# detection loop, only for fence redaction.
while IFS= read -r line; do
# Pass over whitespace-delimited tokens. We track an offset within the
# line to compute left/right context.
local offset=0 trimmed_line="$line"
local -a words
# shellcheck disable=SC2206
words=( $line )
local w widx=0
for w in "${words[@]}"; do
[ -z "$w" ] && continue
# Compute left context: ~20 chars before the word. Manual slice for
# macOS bash 3.2 — `${pos: -20}` returns empty when len(pos) < 20.
local pos="${line%%"$w"*}"
local _plen=${#pos}
local left
if [ "$_plen" -le 20 ]; then
left="$pos"
else
left="${pos:$((_plen-20))}"
fi
local right_pos="${line#*"$w"}"
local right="${right_pos:0:20}"
# Try comma-split sub-tokens too (e.g. "a@b.com,c@d.com").
local sub
local IFS=','
for sub in $w; do
[ -z "$sub" ] && continue
unset IFS
local result
result=$(_auto_phi_classify_tiered "$sub" "$left" "$right" "$line")
if [ -n "$result" ]; then
token_tier="${result%%|*}"
token_cat="${result##*|}"
hits+="${token_tier}|${token_cat}|${sub}"$'\n'
else
# Try Tier-4 already-known.
local known
known=$(_auto_phi_check_known "$sub" 2>/dev/null)
if [ -n "$known" ]; then
token_cat="${known%%|*}"
hits+="known|${token_cat}|${sub}"$'\n'
fi
fi
IFS=','
done
unset IFS
widx=$((widx+1))
done
done <<< "$scan"
[ -z "$hits" ] && { printf '%s' "$input"; return 0; }
# Dedupe hits (preserving first-seen order).
local seen_hash=""
local uniq_hits=""
local h
while IFS= read -r h; do
[ -z "$h" ] && continue
case $'\n'"$seen_hash"$'\n' in
*$'\n'"$h"$'\n'*) continue ;;
esac
seen_hash+="$h"$'\n'
uniq_hits+="$h"$'\n'
done <<< "$hits"
# Per-category counters for the status summary.
local -A cat_count=()
while IFS= read -r h; do
[ -z "$h" ] && continue
local tier="${h%%|*}"; local rest="${h#*|}"
local cat="${rest%%|*}"; local orig="${rest#*|}"
# Confirm mode gating (Tier 3-4 only).
if [ "$AUTO_PHI_MODE" = "confirm" ]; then
_auto_phi_confirm "$orig" "$cat" "$tier" || continue
fi
# Tokenize via the canonical pipeline.
local token
token=$("$sanitize_script" tokenize-value --category "$cat" "$orig" 2>/dev/null)
[ -z "$token" ] && continue
# Substitute. Literal string replace catches all occurrences.
input="${input//"$orig"/"$token"}"
# Bookkeeping.
cat_count[$cat]=$(( ${cat_count[$cat]:-0} + 1 ))
AUTO_PHI_SESSION_COUNT=$(( AUTO_PHI_SESSION_COUNT + 1 ))
# Audit. Context: short slice around the value from the ORIGINAL input.
local ctx; ctx=$(printf '%s' "$scan" | grep -F -- "$orig" | head -1 | head -c 80)
_auto_phi_log "$orig" "$cat" "$token" "$tier" "$surface" "$ctx"
done <<< "$uniq_hits"
# Emit a single status summary if anything was tokenized.
if [ ${#cat_count[@]} -gt 0 ]; then
local summary="" total=0 k
for k in "${!cat_count[@]}"; do
summary+="${k}×${cat_count[$k]} "
total=$(( total + cat_count[$k] ))
done
summary="${summary% }"
printf '%sphi>%s auto-tokenized %d value(s) [%s]: %s\n' \
"$C_DIM" "$C_RESET" "$total" "$surface" "$summary" >&2
fi
printf '%s' "$input"
}
tool_hl7_sanitize() { tool_hl7_sanitize() {
local input_path="$1" strict="${2:-0}" local input_path="$1" strict="${2:-0}"
_lib_err_if_missing || return _lib_err_if_missing || return
@ -2498,6 +3089,51 @@ agent_turn() {
result="Tool returned malformed JSON; raw body: $(printf '%s' "$result" | head -c 200)" result="Tool returned malformed JSON; raw body: $(printf '%s' "$result" | head -c 200)"
;; ;;
esac esac
# v0.7.3 — auto-PHI on tool results.
# Gating per spec: only HL7-shaped output gets sanitized. We allow-list
# the tool names that can return HL7 (read_file of an .hl7/.txt, hl7_*
# tools, nc_msgs). Generic outputs (list_dir, grep_files, bash_exec,
# glob_files, web search, etc.) are NEVER touched — the spec is
# explicit: false positives there would break legitimate text.
#
# For HL7-shaped results we route through hl7-sanitize.sh (the
# canonical field-aware pipeline) — NOT auto_detect_phi (which is
# designed for prose, not pipe-delimited segment data). The two
# share lookup.tsv so tokens are stable across surfaces.
if [ "$AUTO_PHI_MODE" != "off" ]; then
local _ap_eligible=0
case "$name" in
read_file)
local _ap_path
_ap_path=$(printf '%s' "$input_json" | jq -r '.path // ""' 2>/dev/null)
case "$_ap_path" in
*.hl7|*.HL7|*.txt|*.TXT) _ap_eligible=1 ;;
esac
;;
nc_msgs|hl7_field|hl7_diff) _ap_eligible=1 ;;
esac
if [ "$_ap_eligible" = "1" ] && _auto_phi_looks_like_hl7 "$result"; then
local _ap_tmp _ap_sanitized _ap_before _ap_after
_ap_tmp=$(mktemp)
printf '%s' "$result" > "$_ap_tmp"
_ap_before=$(bash "$LARRY_LIB_DIR/hl7-sanitize.sh" count 2>/dev/null || echo 0)
_ap_sanitized=$(bash "$LARRY_LIB_DIR/hl7-sanitize.sh" "$_ap_tmp" 2>/dev/null)
if [ -n "$_ap_sanitized" ]; then
_ap_after=$(bash "$LARRY_LIB_DIR/hl7-sanitize.sh" count 2>/dev/null || echo 0)
result="$_ap_sanitized"
local _ap_new=$((_ap_after - _ap_before))
if [ "$_ap_new" -gt 0 ]; then
printf '%sphi>%s auto-tokenized %d HL7 field(s) in %s result [tool_result]\n' \
"$C_DIM" "$C_RESET" "$_ap_new" "$name" >&2
AUTO_PHI_SESSION_COUNT=$(( AUTO_PHI_SESSION_COUNT + _ap_new ))
_auto_phi_log "(hl7-sanitize batch)" "BATCH" "(+${_ap_new} tokens)" "hl7_pipeline" "tool_result" "$name"
fi
fi
rm -f "$_ap_tmp"
fi
fi
_LARRY_LAST_TOOL_RESULT="$result" _LARRY_LAST_TOOL_RESULT="$result"
log_append '```'; log_append "$result"; log_append '```' log_append '```'; log_append "$result"; log_append '```'
@ -2607,6 +3243,41 @@ Slash commands:
all collapse to the same token. all collapse to the same token.
Category is auto-detected from value shape (MRN/SSN/DOB/NAME/MANUAL). Category is auto-detected from value shape (MRN/SSN/DOB/NAME/MANUAL).
{{phi:VALUE}} / {{phi:CAT:VALUE}} legacy syntax (still works) {{phi:VALUE}} / {{phi:CAT:VALUE}} legacy syntax (still works)
Automatic PHI detection (v0.7.3, supersedes the af2ffe8 prototype):
Larry scans every prompt AND every HL7-shaped tool result for PHI-
shaped values and tokenizes them BEFORE conversation history is sent
to Anthropic. Four-tier confidence model with explicit blacklists for
paths, HL7 field refs (PID.18 is not an MRN), version strings, port
numbers, error codes, JSON keys, and fenced code blocks.
Tiers (first-match wins, all tokenize unless mode=off):
1 DEFINITE SSN with dashes, email, formatted phone, NPI (with
"NPI:" prefix). Always.
2 CONTEXTUAL Numeric value after MRN/Patient/DOB/Account/Visit/
Acct/Record/Birth keyword. Always.
3 HL7-CONTEXT Plausible-PHI values when PID.3/PID.5/PID.7/PID.11/
PID.13/PID.18, NK1.*, GT1.*, IN1.16-20 mentioned in
the same line. Aggressive — prompts in confirm mode.
4 KNOWN Value matches an existing lookup.tsv entry — Bryan
has seen this value before. Always.
Tool-result scan applies ONLY to read_file (.hl7/.txt), hl7_*, and
nc_msgs outputs, and ONLY if content is HL7-shaped. Generic outputs
(list_dir, grep_files, bash_exec, web search) are NEVER scanned — the
risk of breaking legitimate text outweighs the catch rate.
Modes (env LARRY_AUTO_PHI or /phi-auto):
on default — all four tiers always tokenize (caution-first)
confirm Tier 3-4 prompts Y/n once per session per canonical value
off disable auto-detection entirely (manual markers still work)
Per-turn override: prefix any prompt with "!nophi " to skip the scan
for that turn only. Explicit @@VALUE / {{phi:VALUE}} markers always
win — they are processed first; auto-PHI fills only the gaps.
Audit: every tokenization writes a JSONL entry to
\$LARRY_HOME/log/auto-phi.log (ts/value/category/token/tier/surface/context).
/redetect re-scan for HCIROOT/HCISITE/tools /redetect re-scan for HCIROOT/HCISITE/tools
/sites list site dirs under HCIROOT /sites list site dirs under HCIROOT
/site <name> switch HCISITE for this session /site <name> switch HCISITE for this session
@ -2742,6 +3413,7 @@ _LARRY_SLASH_CMDS=(
/hl7-fields /hl7-fields
/mouse /mouse
/origin /origin
/phi-auto
) )
# _LARRY_SLASH_CMDS_DESC — one-line descriptions for each slash command. # _LARRY_SLASH_CMDS_DESC — one-line descriptions for each slash command.
@ -2793,6 +3465,7 @@ _LARRY_SLASH_CMDS_DESC=(
[/hl7-fields]="<SEG.FIELD> print component breakdown (e.g. /hl7-fields PID.5)" [/hl7-fields]="<SEG.FIELD> print component breakdown (e.g. /hl7-fields PID.5)"
[/mouse]="on|off toggle xterm mouse mode for this session" [/mouse]="on|off toggle xterm mouse mode for this session"
[/origin]="show/pin auto-update origin (gitea|github|auto|<https URL>)" [/origin]="show/pin auto-update origin (gitea|github|auto|<https URL>)"
[/phi-auto]="on|off|confirm|status — runtime control for v0.7.3 auto PHI detection"
) )
# __larry_complete_slash — bound to TAB via `bind -x` (see _install_readline_tab). # __larry_complete_slash — bound to TAB via `bind -x` (see _install_readline_tab).
@ -3593,6 +4266,30 @@ main_loop() {
;; ;;
esac esac
continue ;; continue ;;
# v0.7.3 — runtime control for automatic PHI detection.
/phi-auto|/phi-auto\ *)
local _arg; _arg=$(_slash_args "/phi-auto" "$input")
case "${_arg:-status}" in
on)
AUTO_PHI_MODE="on"
larry_say "auto-PHI: on (default — Tier 1-4 detections tokenized; err on caution)"
;;
off)
AUTO_PHI_MODE="off"
larry_say "auto-PHI: off (explicit markers @@VALUE / {{phi:VALUE}} still work)"
;;
confirm)
AUTO_PHI_MODE="confirm"
larry_say "auto-PHI: confirm (Tier 3-4 matches prompt Y/n; Tier 1-2 still always tokenize)"
;;
status)
larry_say "auto-PHI: $AUTO_PHI_MODE (this session tokenized: $AUTO_PHI_SESSION_COUNT) log: $AUTO_PHI_LOG"
;;
*)
err "usage: /phi-auto on|off|confirm (no arg → status)"
;;
esac
continue ;;
/mouse|/mouse\ *) /mouse|/mouse\ *)
local _arg; _arg=$(_slash_args "/mouse" "$input") local _arg; _arg=$(_slash_args "/mouse" "$input")
case "${_arg:-status}" in case "${_arg:-status}" in
@ -3851,6 +4548,13 @@ EOF
input=$(preprocess_phi_markers "$input") input=$(preprocess_phi_markers "$input")
fi fi
# v0.7.3 — Automatic PHI detection. Runs AFTER explicit-marker handling so
# @@VALUE / {{phi:VALUE}} hits take precedence; auto-PHI fills gaps in
# things Bryan didn't manually mark. Per-turn "!nophi " prefix override
# is consumed inside auto_detect_phi. Bypassed entirely when mode=off.
# Supersedes af2ffe8 (reverted with v0.7.1).
input=$(auto_detect_phi user_input "$input")
log_section "user"; log_append "$input" log_section "user"; log_append "$input"
# v0.7.1: render the persistent status line BETWEEN turns — after the # v0.7.1: render the persistent status line BETWEEN turns — after the
# user has submitted real (non-slash, non-empty) input and after all # user has submitted real (non-slash, non-empty) input and after all

View File

@ -202,6 +202,26 @@ normalize_value() {
# it as an option flag. Same caveat applies for any future tr -d call. # it as an option flag. Same caveat applies for any future tr -d call.
printf '%s' "$value" | tr -d '[:space:]-' printf '%s' "$value" | tr -d '[:space:]-'
;; ;;
PHONE)
# v0.7.3: digits-only canonical so "(555) 123-4567", "5551234567",
# "555.123.4567", "555-123-4567" all collapse to one token.
# (Originally added in af2ffe8 — reverted with v0.7.1; re-added per
# v0.7.3 auto-PHI spec.)
printf '%s' "$value" | tr -cd '[:digit:]'
;;
EMAIL)
# v0.7.3: lowercase + outer-trim. RFC 5321 says the local-part is
# technically case-sensitive but in practice every mainstream provider
# treats it case-insensitively, so collapsing for PHI purposes is safe.
# (Originally added in af2ffe8 — reverted with v0.7.1; re-added per
# v0.7.3 auto-PHI spec.)
printf '%s' "$value" | tr '[:upper:]' '[:lower:]' | awk '{$1=$1; print}'
;;
NPI)
# v0.7.3: National Provider Identifier — 10 digits. Strip whitespace
# to canonical 10-digit form.
printf '%s' "$value" | tr -d '[:space:]-'
;;
*) *)
printf '%s' "$value" | awk '{$1=$1; print}' printf '%s' "$value" | awk '{$1=$1; print}'
;; ;;
@ -437,6 +457,35 @@ case "$SUB" in
count) shift; cmd_count "$@" ;; count) shift; cmd_count "$@" ;;
tokenize-value) shift; cmd_tokenize_value "$@" ;; tokenize-value) shift; cmd_tokenize_value "$@" ;;
detokenize-value) shift; cmd_detokenize_value "$@" ;; detokenize-value) shift; cmd_detokenize_value "$@" ;;
normalize-value)
# v0.7.3: normalize-value VALUE [CATEGORY] — emit canonical form without
# tokenizing or touching the table. Used by larry.sh's v0.7.3 auto-PHI
# detection to build per-session memory keys so "John Smith" and
# "JOHN SMITH" share one accept/decline decision.
# (Originally added in af2ffe8 — reverted with v0.7.1; re-added per
# v0.7.3 auto-PHI spec.)
shift
nv_val="${1:-}"; nv_cat="${2:-}"
[ -n "$nv_val" ] || die "normalize-value needs a VALUE"
[ -n "$nv_cat" ] || nv_cat=$(detect_category "$nv_val")
normalize_value "$nv_val" "$nv_cat"
printf '\n'
;;
lookup-original)
# v0.7.3: lookup-original VALUE CATEGORY — return token if value is
# already present in the lookup table (Tier-4 "already-known" check).
# Empty stdout + exit 0 if not found; non-empty token + exit 0 if found.
shift
lo_val="${1:-}"; lo_cat="${2:-}"
[ -n "$lo_val" ] || die "lookup-original needs a VALUE"
[ -n "$lo_cat" ] || lo_cat=$(detect_category "$lo_val")
lo_canonical=$(normalize_value "$lo_val" "$lo_cat")
[ -z "$lo_canonical" ] && lo_canonical="$lo_val"
[ -f "$DEFAULT_TABLE" ] || { printf ''; exit 0; }
awk -F'\t' -v cat="$lo_cat" -v c="$lo_canonical" -v orig="$lo_val" '
NR>1 && $2==cat && ($3==c || $4==orig) { print $1; exit }
' "$DEFAULT_TABLE"
;;
-h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;; -h|--help) sed -n '2,30p' "$NC_SELF"; exit 0 ;;
*) *)
# Default = sanitize mode # Default = sanitize mode