diff --git a/CHANGELOG.md b/CHANGELOG.md index 988b7c5..6615284 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,111 @@ All notable changes to `cloverleaf-larry` / `larry-anywhere` are recorded here. Versioning is loose-semver; bumps trigger the in-process self-update on every running client via `LARRY_BASE_URL` + `MANIFEST`. +## v0.8.2 — 2026-05-27 + +Microsoft Presidio sidecar for free-text NER. Closes V1 from Vera's audit — +the dominant real-world failure mode (patient names, addresses, un-keyworded +dates in prose chat). Opt-in install; larry runs in v0.8.1 mode on hosts +where Presidio isn't installed (MobaXterm/Cygwin per Bryan's accepted +tradeoff). + +- **`lib/phi-presidio-sidecar.py`** — FastAPI service on + `127.0.0.1:$LARRY_PHI_PORT` (default `41189`). Wraps Presidio's + `AnalyzerEngine` + `AnonymizerEngine` over spaCy `en_core_web_sm` + (12MB model, ~9-second cold start). Two endpoints: `POST /redact` + takes `{"text": "..."}` and returns `{"redacted": "...", "entities": + [...], "latency_ms": N}`; `GET /health` for the launcher's readiness + probe. Three HL7-specific custom recognizers added (`HL7_MRN` for + 6-12 digit numerics with patient/MRN/account context; `HL7_CARET_NAME` + for `SMITH^JOHN` outside Tier-3 line context; `HL7_PHONE_BARE` for + plain 10-digit phones). Confidence threshold for tier-5 tokenize is + 0.3 (below that is too noisy). + +- **`lib/phi-sidecar.sh`** — lifecycle launcher. Subcommands: + `start / stop / status / health / ensure`. `ensure` is idempotent + (no-op if already up); called from `larry.sh` main_loop startup, + backgrounded so it never blocks larry's first prompt. Waits up to + 30 seconds for the sidecar to become healthy after `start`; surfaces + the log tail if startup fails. PID file at + `$LARRY_HOME/.phi-sidecar.pid`; log at `$LARRY_HOME/log/phi-sidecar.log`. + Honors `LARRY_PHI_VENV` env to use a dedicated virtualenv (which the + installer sets up at `$LARRY_HOME/phi-venv` when the user opts in). + +- **`lib/phi-client.sh`** — bash wrapper around `/redact`. Sourceable + functions: `phi_client_available`, `phi_redact_text`, `phi_redact_entities`. + Also runs standalone as a CLI (`./phi-client.sh check / redact / entities`). + CR-safe (sources `cygwin-safe.sh` defensively); 5-second curl timeout + bounds any tier-5 stall. + +- **Tier-5 integration in `larry.sh:auto_detect_phi`.** New stage AFTER + the existing tier-1/2/3/4 substitution and BEFORE the status summary. + Sources `phi-client.sh` lazily, probes `phi_client_available`, and on + success runs `phi_redact_entities` to get Presidio's per-entity output. + Each entity is tokenized through the SAME `hl7-sanitize.sh tokenize-value` + pipeline as tiers 1-4 (category prefixed `presidio_`) so token IDs + remain stable across surfaces and the `/tokens` listing stays unified. + Tier-5 honors `LARRY_AUTO_PHI=confirm` (prompts Y/n once per value) and + `strict` (aborts the turn if `tokenize-value` fails on a Presidio hit). + Critically, v0.8.2 removes the v0.7.3 early-return that exited + `auto_detect_phi` when tiers 1-4 found nothing — pure-prose input now + ALWAYS reaches tier-5. + +- **Graceful degradation.** If the sidecar is unreachable (not installed, + not started, crashed), tier-5 silently no-ops with a one-time stderr + warning per session. Larry's REPL remains fully functional in v0.8.1 + mode. `LARRY_AUTO_PHI=strict` does NOT abort on absent sidecar (the + strict mode escape is for HL7-shaped content where rule-pack would + have caught the leak; tier-5 is additive coverage). + +- **`/phi-sidecar` slash command** — `start / stop / status / health / + ensure` exposed to the user. Slash-completion table and `_LARRY_SLASH_CMDS_DESC` + updated. + +- **`install-larry.sh` install path.** On hosts with Python 3.9+ + pip, + the installer prompts before creating `$LARRY_HOME/phi-venv` and + installing `presidio_analyzer + presidio_anonymizer + fastapi + + uvicorn + spaCy en_core_web_sm` (~400MB on disk, ~250MB RAM resident). + On MobaXterm/Cygwin without python3, the installer skips the prompt + entirely and prints Bryan's accepted tradeoff (MobaXterm stays on + v0.8.1 + nudges). Re-runnable; idempotent. + +- **MANIFEST.** Added three new lib files. They auto-sync to every + running client on next launch; clients without Python 3 won't run + the sidecar but the files are harmless to ship. + +**Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14).** +Cold start (model load): ~9 seconds with `en_core_web_sm` (vs ~82s with +the larger `en_core_web_lg` Presidio auto-downloads by default — we +explicitly pin `_sm` for the latency-sensitive REPL use case). Warm +analyzer latency: P50 20.6ms, P95 22.7ms over 20 sequential requests +on 100-word input. End-to-end HTTP round-trip (curl + json roundtrip): +P50 ~57ms warm; first request post-startup pays a ~150ms tokenizer +warmup tax then steady. Well under the 200ms-per-turn REPL budget. + +Detection quality on the canonical "John Doe MRN 623000286" sample: 8 +core entities caught (PERSON x2, DATE_TIME x2, PHONE_NUMBER, US_*), +plus the three custom HL7 recognizers add MRN + caret-name + bare-phone +coverage. Misclassifications (MRN as US_PASSPORT, "ED" as PERSON) are +within tolerance for the tokenize-everything-suspicious policy — the +auto-PHI lookup table sees them as `presidio_*` categories and the +operator can audit via `/tokens`. + +**MobaXterm compatibility verdict.** Per Bryan's accepted tradeoff: +v0.8.2 ships Mac/Linux-only. MobaXterm/Cygwin stays on v0.8.1 +(rule-pack + path-block + content-shape gating + strict mode + base64 +round-trip + tool-result review gate). Test path: install-larry.sh +detects platform and skips the Presidio install on `windows-cygwin` +with a clear "v0.8.1 mode" note. No code in larry.sh is platform-gated +— tier-5 silently no-ops when the sidecar is absent, which IS the +MobaXterm path. + +**Proactive same-pattern sweep.** Searched for other call sites where +free-text NER would help: tool-result surface already gets HL7-shape +sanitize (v0.8.1) and base64 round-trip (v0.8.1-c). Tier-5 is +user_input-only by design — tool-result free-text NER deferred to a +future patch (would require deciding on per-tool latency budgets; +Bryan to call when needed). + ## v0.8.1 — 2026-05-27 Tool-result PHI gating expansion. Closes V2 / V12 and the V2 base64 sub-gap diff --git a/MANIFEST b/MANIFEST index d2dc5a6..bfb0925 100644 --- a/MANIFEST +++ b/MANIFEST @@ -44,6 +44,16 @@ lib/hl7-diff.sh lib/hl7-field.sh lib/hl7-schema.sh +# v0.8.2: Microsoft Presidio sidecar (optional, opt-in install). +# Closes V1 free-text PHI gap from Vera's audit. Requires Python 3.9+ and +# pip install presidio_analyzer + presidio_anonymizer + fastapi + uvicorn +# + spaCy en_core_web_sm. install-larry.sh offers to install on first run. +# Larry's tier-5 silently skips when sidecar is unreachable, so syncing +# these files is safe even on hosts where Python deps aren't installed. +lib/phi-presidio-sidecar.py +lib/phi-sidecar.sh +lib/phi-client.sh + # Generic helpers lib/each.sh lib/each-site.sh diff --git a/VERSION b/VERSION index 6f4eebd..100435b 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.8.1 +0.8.2 diff --git a/install-larry.sh b/install-larry.sh index cff8d35..5ed2cc6 100755 --- a/install-larry.sh +++ b/install-larry.sh @@ -192,6 +192,87 @@ else warn "cannot write to $LARRY_BIN_DIR — invoke larry directly as: $LARRY_HOME/larry.sh" fi +# ───────────────────────────────────────────────────────────────────────────── +# v0.8.2 — optional PHI Presidio sidecar (free-text NER). +# Closes V1 from Vera's PHI-leak audit. Opt-in install; larry runs in +# v0.8.1 mode (rule-pack only) on hosts where this isn't installed. +# We probe for python3 + pip, then offer the install. Skip silently if +# python3 isn't available — keeps the install one-shot on raw MobaXterm +# where Python may not be present. +# ───────────────────────────────────────────────────────────────────────────── +if command -v python3 >/dev/null 2>&1; then + PYV=$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null || echo "") + case "$PYV" in + 3.9|3.10|3.11|3.12|3.13|3.14|3.15) PY_OK=1 ;; + *) PY_OK=0 ;; + esac + if [ "${PY_OK:-0}" = "1" ]; then + say "v0.8.2: Presidio PHI sidecar is available (python $PYV detected)" + echo " Presidio provides free-text NER (names, addresses, dates in prose)" + echo " that the regex tiers miss. Install adds presidio_analyzer +" + echo " presidio_anonymizer + fastapi + uvicorn + spaCy en_core_web_sm" + echo " to a dedicated virtualenv at $LARRY_HOME/phi-venv (~400MB on disk," + echo " ~250MB RAM resident when running). One-time cost; tier-5 NER" + echo " then runs on every prompt with ~20ms latency." + echo "" + # Heuristic: if stdin is a TTY, prompt. Otherwise (curl|bash pipe), skip. + INSTALL_PHI="" + if [ -t 0 ]; then + printf 'install Presidio sidecar now? [y/N]: ' + read -r INSTALL_PHI /dev/null 2>&1; then + if "$LARRY_HOME/phi-venv/bin/pip" install --quiet \ + presidio_analyzer presidio_anonymizer fastapi uvicorn >/dev/null 2>&1; then + if "$LARRY_HOME/phi-venv/bin/python" -m spacy download en_core_web_sm \ + >/dev/null 2>&1; then + ok "Presidio sidecar installed (venv: $LARRY_HOME/phi-venv)" + # Set LARRY_PHI_VENV in the shim so larry auto-uses it. + if [ -f "$LARRY_BIN_DIR/larry" ]; then + sed -i.bak "s|^exec \"|export LARRY_PHI_VENV=\"$LARRY_HOME/phi-venv\"\nexec \"|" \ + "$LARRY_BIN_DIR/larry" 2>/dev/null || true + rm -f "$LARRY_BIN_DIR/larry.bak" + fi + else + warn "spaCy en_core_web_sm download failed; sidecar will not start until model is present" + fi + else + warn "pip install failed; Presidio sidecar not available on this host (larry runs in v0.8.1 mode)" + fi + else + warn "python3 -m venv failed; cannot install Presidio (larry runs in v0.8.1 mode)" + fi + ;; + *) + ok "skipped Presidio install — larry runs in v0.8.1 mode (rule-pack auto-PHI only)" + ;; + esac + else + warn "python3 detected but version ($PYV) is not 3.9+; Presidio sidecar requires 3.9+" + warn "larry runs in v0.8.1 mode (rule-pack auto-PHI only) on this host" + fi +else + case "$PLATFORM" in + windows-cygwin) + warn "python3 not detected on Cygwin/MobaXterm. v0.8.2 Presidio sidecar SKIPPED." + warn "Bryan's accepted tradeoff: MobaXterm stays on v0.8.1 + prompt nudges." + ;; + *) + warn "python3 not on PATH; Presidio sidecar skipped (larry runs in v0.8.1 mode)" + ;; + esac +fi + # ───────────────────────────────────────────────────────────────────────────── # Done # ───────────────────────────────────────────────────────────────────────────── diff --git a/larry.sh b/larry.sh index cad6663..000a00f 100755 --- a/larry.sh +++ b/larry.sh @@ -57,7 +57,7 @@ set -o pipefail # ───────────────────────────────────────────────────────────────────────────── # Config # ───────────────────────────────────────────────────────────────────────────── -LARRY_VERSION="0.8.1" +LARRY_VERSION="0.8.2" LARRY_HOME="${LARRY_HOME:-$HOME/.larry}" # ───────────────────────────────────────────────────────────────────────────── @@ -1753,8 +1753,14 @@ auto_detect_phi() { done done <<< "$scan" - [ -z "$hits" ] && { printf '%s' "$input"; return 0; } + # v0.8.2: don't early-return when tiers 1-4 found nothing — tier-5 + # (Presidio NER) is the WHOLE POINT of catching free-text gaps. We run + # tier-5 below regardless of $hits. Per-category counters stay scoped + # at function level so both tier-1-4 and tier-5 share the summary. + local -A cat_count=() + # Tier-1-4 substitution (skipped when no hits). + if [ -n "$hits" ]; then # Dedupe hits (preserving first-seen order). local seen_hash="" local uniq_hits="" @@ -1768,9 +1774,6 @@ auto_detect_phi() { uniq_hits+="$h"$'\n' done <<< "$hits" - # Per-category counters for the status summary. - local -A cat_count=() - while IFS= read -r h; do [ -z "$h" ] && continue local tier="${h%%|*}"; local rest="${h#*|}" @@ -1808,6 +1811,152 @@ auto_detect_phi() { local ctx; ctx=$(printf '%s' "$scan" | grep -F -- "$orig" | head -1 | head -c 80) _auto_phi_log "$orig" "$cat" "$token" "$tier" "$surface" "$ctx" done <<< "$uniq_hits" + fi # end: if [ -n "$hits" ] — v0.8.2 wrapper so tier-5 runs unconditionally + + # v0.8.2 — Tier-5: free-text NER via Presidio sidecar. + # Runs AFTER tier-1/2/3/4 (so explicit-marker tokens stay stable and known + # values already have their canonical tokens) but BEFORE the status summary. + # Tier-5 catches what the regex+keyword tiers miss: bare patient names in + # prose ("the patient John Doe..."), addresses without keyword context, + # un-keyworded dates, generic phone numbers. Closes V1 from Vera's audit. + # + # Graceful degradation: if the sidecar isn't reachable (not installed, + # not started, crashed), tier-5 silently no-ops — preserves v0.8.1 behavior. + # The one exception is LARRY_AUTO_PHI=strict on HL7-shaped input — handled + # at the top of this function already. + if [ "$AUTO_PHI_MODE" != "off" ] \ + && [ -r "$LARRY_LIB_DIR/phi-client.sh" ]; then + # Source the client lazily (per-call). The functions are tiny and + # sourcing each turn lets users update the client without restart. + # shellcheck source=lib/phi-client.sh + . "$LARRY_LIB_DIR/phi-client.sh" 2>/dev/null + if declare -F phi_client_available >/dev/null 2>&1 && phi_client_available; then + # Run Presidio on a copy where already-minted [[CAT_NNNN]] tokens are + # masked to neutral fixed-width placeholders. This stops Presidio from + # tagging text that spans an existing token (which would then corrupt + # the token when we literal-replace). We map placeholder→token so the + # entity offsets still align, but since we substitute by VALUE (not + # offset) below, the mask just needs to remove tokens from Presidio's + # view. We use a regex-neutral run of 'x' the same length per token. + local _t5_scan="$input" + # Replace each [[...]] token with same-length x-run so offsets are + # preserved and Presidio sees no bracket structure. + _t5_scan=$(printf '%s' "$_t5_scan" | sed -E 's/\[\[[A-Za-z0-9_]+\]\]/XXXXXXXXXX/g') + local _t5_entities + _t5_entities=$(phi_redact_entities "$_t5_scan" 2>/dev/null) || _t5_entities="" + if [ -n "$_t5_entities" ]; then + # Format: TYPE\tSTART\tEND\tSCORE\tVALUE per line. + # Sort by descending start offset so substituting longest/latest first + # doesn't shift earlier offsets (we're using literal string-replace, + # but stable ordering keeps the audit log sensible). + local _t5_count=0 _t5_line _t5_type _t5_value _t5_score _t5_cat _t5_token + while IFS=$'\t' read -r _t5_type _t5_start _t5_end _t5_score _t5_value; do + [ -z "$_t5_value" ] && continue + # Drop low-confidence noise. Bryan's tier-3/4 strictness applies + # equally here — confidence < 0.3 is too noisy for auto-tokenize. + local _t5_int_score + _t5_int_score=$(printf '%s' "$_t5_score" | awk '{print int($1*100)}') + if [ "${_t5_int_score:-0}" -lt 30 ]; then continue; fi + # Skip values that look like HL7 field refs or paths (shared + # blacklists with the per-word classifier). + if declare -F _auto_phi_skip_path_like >/dev/null 2>&1; then + _auto_phi_skip_path_like "$_t5_value" && continue + fi + if declare -F _auto_phi_skip_version >/dev/null 2>&1; then + _auto_phi_skip_version "$_t5_value" && continue + fi + # Skip if the value is already a token (don't double-tokenize). + case "$_t5_value" in + \[\[*\]\]) continue ;; + *\[\[*) continue ;; # value spans/contains a token fragment + *XXXXXXXXXX*) continue ;; # value spans a masked token placeholder + esac + # Noise guard: drop bare uppercase field-label acronyms Presidio + # over-eagerly tags as ORGANIZATION ("SSN", "MRN", "DOB", "ED", + # "Phone", "ADT"). These are HL7/clinical jargon, not PHI. We keep + # them out of the tokenize set to avoid (a) noise and (b) the + # substring-corruption class (a 3-letter value substring-matching + # inside another token). A real name is mixed-case or multi-word. + case "$_t5_value" in + [A-Z][A-Z]|[A-Z][A-Z][A-Z]|[A-Z][A-Z][A-Z][A-Z]) continue ;; + esac + # Skip very short single tokens (< 3 chars) — too collision-prone + # for literal-string replace. + if [ "${#_t5_value}" -lt 3 ]; then continue; fi + # Token-safe substitution guard: if the value occurs ONLY as a + # substring of an existing [[...]] token in the current input, + # skip it (replacing would corrupt the token). We check by + # masking tokens and seeing if the value still appears. + local _t5_masked + _t5_masked=$(printf '%s' "$input" | sed -E 's/\[\[[A-Za-z0-9_]+\]\]/\x01/g') + case "$_t5_masked" in + *"$_t5_value"*) : ;; # appears outside any token — safe + *) continue ;; # only inside tokens — skip + esac + # Map Presidio entity types to lookup.tsv categories. Prefix with + # presidio_ so they stay distinguishable from rule-pack categories + # in audit logs and the /tokens listing. + _t5_cat="presidio_${_t5_type}" + # Confirm mode (Tier 3/4 style) — prompt once per value. + if [ "$AUTO_PHI_MODE" = "confirm" ]; then + _auto_phi_confirm "$_t5_value" "$_t5_cat" "presidio" || continue + fi + _t5_token=$("$sanitize_script" tokenize-value --category "$_t5_cat" "$_t5_value" 2>/dev/null) + if [ -z "$_t5_token" ]; then + if [ "$AUTO_PHI_MODE" = "strict" ]; then + printf 'error: auto-PHI tokenize-value returned empty for tier-5 value (category=%s); LARRY_AUTO_PHI=strict aborts turn\n' \ + "$_t5_cat" >&2 + return 42 + fi + continue + fi + # Token-protected literal substitution. Existing [[...]] tokens are + # pulled out to numbered sentinels, the tier-5 value is replaced in + # the remaining text, then the sentinels are restored. This is + # robust against a value that happens to be a substring of an + # existing token (e.g. a digit run that also appears in a token ID) + # — tiers 1-4 use plain replace because their values are minted + # fresh and can't collide, but tier-5 runs on already-tokenized text. + local _t5_proto="$input" _t5_sentinel_map="" _t5_tok _t5_idx=0 + # Extract existing tokens into sentinels of the form \x02\x02. + while IFS= read -r _t5_tok; do + [ -z "$_t5_tok" ] && continue + local _t5_sent=$'\x02'"${_t5_idx}"$'\x02' + _t5_proto="${_t5_proto//"$_t5_tok"/"$_t5_sent"}" + _t5_sentinel_map+="${_t5_idx}"$'\t'"${_t5_tok}"$'\n' + _t5_idx=$(( _t5_idx + 1 )) + done < <(printf '%s' "$input" | grep -oE '\[\[[A-Za-z0-9_]+\]\]' | sort -u) + # Replace the value in the protected (sentinel-bearing) text. + _t5_proto="${_t5_proto//"$_t5_value"/"$_t5_token"}" + # Restore sentinels back to their original tokens. + local _t5_mline _t5_mid _t5_mtok + while IFS=$'\t' read -r _t5_mid _t5_mtok; do + [ -z "$_t5_mid" ] && continue + local _t5_sent2=$'\x02'"${_t5_mid}"$'\x02' + _t5_proto="${_t5_proto//"$_t5_sent2"/"$_t5_mtok"}" + done <<< "$_t5_sentinel_map" + input="$_t5_proto" + cat_count[$_t5_cat]=$(( ${cat_count[$_t5_cat]:-0} + 1 )) + AUTO_PHI_SESSION_COUNT=$(( AUTO_PHI_SESSION_COUNT + 1 )) + _t5_count=$(( _t5_count + 1 )) + _auto_phi_log "$_t5_value" "$_t5_cat" "$_t5_token" "presidio" "$surface" "score=$_t5_score" + done <<< "$_t5_entities" + if [ "$_t5_count" -gt 0 ]; then + printf '%sphi>%s tier-5 (presidio NER) auto-tokenized %d additional value(s) [%s]\n' \ + "$C_DIM" "$C_RESET" "$_t5_count" "$surface" >&2 + fi + fi + else + # Sidecar unreachable — emit a one-time per-session stderr warning. + if [ -z "${_LARRY_PHI_TIER5_WARNED:-}" ]; then + if [ -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then + printf '%sphi>%s tier-5 (presidio NER) disabled — sidecar not running. Start with: %s/phi-sidecar.sh ensure\n' \ + "$C_DIM" "$C_RESET" "$LARRY_LIB_DIR" >&2 + fi + export _LARRY_PHI_TIER5_WARNED=1 + fi + fi + fi # Emit a single status summary if anything was tokenized. if [ ${#cat_count[@]} -gt 0 ]; then @@ -3852,6 +4001,7 @@ _LARRY_SLASH_CMDS=( /mouse /origin /phi-auto + /phi-sidecar ) # _LARRY_SLASH_CMDS_DESC — one-line descriptions for each slash command. @@ -3904,6 +4054,7 @@ _LARRY_SLASH_CMDS_DESC=( [/mouse]="on|off toggle xterm mouse mode for this session" [/origin]="show/pin auto-update origin (gitea|auto|) — v0.7.4 single-source" [/phi-auto]="on|off|confirm|strict|status — runtime control for v0.7.3+v0.8.0 auto PHI detection" + [/phi-sidecar]="start|stop|status|health|ensure — v0.8.2 Presidio NER sidecar lifecycle" ) # __larry_complete_slash — bound to TAB via `bind -x` (see _install_readline_tab). @@ -4565,6 +4716,19 @@ main_loop() { larry_say "${C_BOLD}Larry-Anywhere v$LARRY_VERSION${C_RESET} ready. Model: $LARRY_MODEL." larry_say "Type your message and press Enter. Use '<<' alone on a line to start multi-line (end with 'EOF'). /help for commands." + + # v0.8.2: best-effort PHI Presidio sidecar start. Backgrounded so larry + # is interactive immediately; tier-5 silently no-ops until the sidecar + # is healthy (which takes ~9s for model load). Skip entirely if + # LARRY_PHI_AUTOSTART=0 or if the sidecar launcher isn't present. + if [ "${LARRY_PHI_AUTOSTART:-1}" = "1" ] \ + && [ -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then + ( + "$LARRY_LIB_DIR/phi-sidecar.sh" ensure >/dev/null 2>&1 || true + ) & + disown 2>/dev/null || true + fi + echo "" while true; do @@ -4767,6 +4931,22 @@ main_loop() { ;; esac continue ;; + # v0.8.2: PHI Presidio sidecar lifecycle. + /phi-sidecar|/phi-sidecar\ *) + local _arg; _arg=$(_slash_args "/phi-sidecar" "$input") + if [ ! -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then + err "phi-sidecar.sh not installed (lib/phi-sidecar.sh missing or non-executable)" + continue + fi + case "${_arg:-status}" in + start|stop|status|health|ensure) + "$LARRY_LIB_DIR/phi-sidecar.sh" "$_arg" + ;; + *) + err "usage: /phi-sidecar start|stop|status|health|ensure (no arg → status)" + ;; + esac + continue ;; /mouse|/mouse\ *) local _arg; _arg=$(_slash_args "/mouse" "$input") case "${_arg:-status}" in diff --git a/lib/phi-client.sh b/lib/phi-client.sh new file mode 100755 index 0000000..b4843c4 --- /dev/null +++ b/lib/phi-client.sh @@ -0,0 +1,117 @@ +#!/usr/bin/env bash +# ───────────────────────────────────────────────────────────────────────────── +# larry-anywhere v0.8.2: PHI Presidio client +# +# Bash wrapper around the Presidio sidecar's /redact endpoint. Sourced from +# larry.sh's auto-PHI pipeline as the tier-5 free-text NER pass. +# +# Functions (sourced): +# phi_client_available — 0 if sidecar reachable; 1 otherwise +# phi_redact_text TEXT — echo redacted form on stdout; non-zero on failure +# (in which case caller leaves TEXT unchanged — +# "fail-open" is the right call for tier-5 alone) +# Standalone: +# ./phi-client.sh check — health probe +# ./phi-client.sh redact "the patient ..." — one-shot redact +# +# Wire-up in larry.sh:auto_detect_phi: +# - After tier-1/2/3/4 produce hits and tokenize, BEFORE add_user_text, +# call phi_redact_text on the (already-partially-tokenized) input. +# - For each entity returned with score > threshold, tokenize via +# hl7-sanitize.sh's tokenize-value (category = presidio_) +# to maintain stable token IDs across surfaces. +# ───────────────────────────────────────────────────────────────────────────── +LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}" +LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}" +LARRY_PHI_TIMEOUT="${LARRY_PHI_TIMEOUT:-5}" # seconds — bounds tier-5 stall + +# Defense against CR-tainted env (Cygwin v0.7.5 lesson). +_phi_client_dir="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" 2>/dev/null && pwd)" +if [ -f "$_phi_client_dir/cygwin-safe.sh" ]; then + # shellcheck source=cygwin-safe.sh + . "$_phi_client_dir/cygwin-safe.sh" 2>/dev/null || true +fi +if declare -F coerce_int >/dev/null 2>&1; then + LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189) + LARRY_PHI_TIMEOUT=$(coerce_int "$LARRY_PHI_TIMEOUT" 5) +fi + +phi_client_available() { + curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1 +} + +# phi_redact_text TEXT → emits redacted TEXT on stdout, non-zero on any failure. +# JSON-quoting handled via jq so payload is safe for any control chars. +phi_redact_text() { + local text="$1" + [ -z "$text" ] && { printf ''; return 0; } + # Build JSON payload via jq -n --arg (handles all escaping correctly). + local payload + payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2 + local resp + resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \ + -X POST -H 'Content-Type: application/json' \ + --data-binary "$payload" \ + "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3 + # Parse out the redacted text. Empty → upstream error. + local redacted + redacted=$(printf '%s' "$resp" | jq -r '.redacted // empty' 2>/dev/null) || return 4 + [ -z "$redacted" ] && return 5 + printf '%s' "$redacted" + return 0 +} + +# Emit the entities (one per line: TYPE\tSTART\tEND\tSCORE\tVALUE) so the +# caller can re-tokenize with hl7-sanitize.sh's tokenize-value pipeline +# (categories: presidio_PERSON, presidio_LOCATION, etc.) for stable IDs. +phi_redact_entities() { + local text="$1" + [ -z "$text" ] && return 0 + local payload resp + payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2 + resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \ + -X POST -H 'Content-Type: application/json' \ + --data-binary "$payload" \ + "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3 + printf '%s' "$resp" | jq -r ' + .entities[]? | + [.type, (.start|tostring), (.end|tostring), (.score|tostring), ($input[(.start|tonumber):(.end|tonumber)])] | + @tsv + ' --argjson input "$(jq -nc --arg t "$text" '$t')" 2>/dev/null +} + +# Standalone CLI mode (when run, not sourced). +if [ "${BASH_SOURCE[0]:-$0}" = "$0" ]; then + case "${1:-}" in + check) + if phi_client_available; then echo "phi-client: sidecar reachable"; exit 0 + else echo "phi-client: sidecar unreachable on $LARRY_PHI_HOST:$LARRY_PHI_PORT" >&2; exit 1; fi + ;; + redact) + shift + [ -z "${1:-}" ] && { echo "usage: phi-client.sh redact " >&2; exit 2; } + phi_redact_text "$1"; echo + ;; + entities) + shift + [ -z "${1:-}" ] && { echo "usage: phi-client.sh entities " >&2; exit 2; } + phi_redact_entities "$1" + ;; + *) + cat < emit redacted text + entities emit entities (TYPE TAB START END SCORE VALUE) + +Functions (when sourced): + phi_client_available + phi_redact_text + phi_redact_entities + +Env: LARRY_PHI_PORT (41189), LARRY_PHI_HOST (127.0.0.1), LARRY_PHI_TIMEOUT (5). +USAGE + ;; + esac +fi diff --git a/lib/phi-presidio-sidecar.py b/lib/phi-presidio-sidecar.py new file mode 100755 index 0000000..1c2cf72 --- /dev/null +++ b/lib/phi-presidio-sidecar.py @@ -0,0 +1,152 @@ +#!/usr/bin/env python3 +""" +larry-anywhere v0.8.2: Microsoft Presidio sidecar for free-text NER. + +Closes V1 from Vera's PHI-leak audit (the dominant real-world failure mode — +patient names / addresses / un-keyworded dates in prose chat). Free-text PHI +flows past the v0.7.3 tier-1/2/3/4 classifier because that classifier is +HL7-segment-aware and keyword-driven, not a general entity recognizer. + +This sidecar runs Microsoft Presidio (spaCy backend + custom recognizers) +as a persistent FastAPI service on 127.0.0.1:$LARRY_PHI_PORT (default 41189). +Larry's main loop hits it via curl as the LAST tier of auto-PHI detection. + +Wire-up: +- lib/phi-sidecar.sh — bash launcher / health-check / lifecycle +- lib/phi-client.sh — bash client (phi_redact_text wrapper) +- larry.sh:auto_detect_phi — calls phi_redact_text as tier-5 (post-explicit-marker, + post-tier-1-to-4, before sending input to model) +- install-larry.sh — offers to pip-install presidio + spacy + en_core_web_sm + +Benchmarks (Bryan's Mac, Apple Silicon, en_core_web_sm): + cold start (model load): ~9 seconds + warm latency (P50/P95): 20ms / 22ms (analyzer only) + HTTP round-trip warm: ~57ms (curl --unix-socket via TCP fallback) +First request post-startup pays a ~150ms tokenizer-warmup tax; thereafter +within the 200ms-per-turn REPL budget Bryan specified. + +Failure mode: if Presidio fails to load (model missing, package broken), +the process exits non-zero. The bash launcher detects this and tells the +user. Larry's tier-5 silently no-ops when the sidecar is unreachable, +preserving v0.8.1 behavior on hosts where Presidio isn't installed. + +Compatibility: requires Python 3.9+ (3.14 tested). MobaXterm/Cygwin +compatibility is gated by spaCy's C-extension wheels; if pip install +presidio_analyzer fails on Cygwin, this host stays on v0.8.1 + nudges +per Bryan's accepted tradeoff. +""" +from __future__ import annotations + +import os +import sys +import time +import logging + +logging.basicConfig(level=logging.WARNING, format="%(asctime)s %(levelname)s %(message)s") +log = logging.getLogger("phi-sidecar") + +try: + from fastapi import FastAPI, Body + from pydantic import BaseModel + from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer + from presidio_analyzer.nlp_engine import NlpEngineProvider + from presidio_anonymizer import AnonymizerEngine + import uvicorn +except ImportError as e: + sys.stderr.write(f"phi-sidecar: missing dependency ({e}); install with:\n") + sys.stderr.write(" pip install presidio_analyzer presidio_anonymizer fastapi uvicorn\n") + sys.stderr.write(" python -m spacy download en_core_web_sm\n") + sys.exit(3) + +LARRY_PHI_PORT = int(os.environ.get("LARRY_PHI_PORT", "41189")) +LARRY_PHI_HOST = os.environ.get("LARRY_PHI_HOST", "127.0.0.1") +LARRY_PHI_MODEL = os.environ.get("LARRY_PHI_MODEL", "en_core_web_sm") + + +# Module-scope request model. MUST be module-level, not function-local — +# pydantic v2 + FastAPI introspection treats a closure-defined model as +# query params (the symptom: 'Field required' on a "query" location for +# the body param), which breaks the /redact endpoint silently. +class RedactReq(BaseModel): + text: str + score_threshold: float = 0.3 # below this confidence we ignore + + +def build_analyzer() -> AnalyzerEngine: + """Load Presidio with en_core_web_sm (small/fast) + HL7-specific custom recognizers.""" + config = { + "nlp_engine_name": "spacy", + "models": [{"lang_code": "en", "model_name": LARRY_PHI_MODEL}], + } + nlp_engine = NlpEngineProvider(nlp_configuration=config).create_engine() + analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"]) + + # Custom recognizers tuned for HL7/Cloverleaf operator chat. These run + # IN ADDITION TO Presidio's built-in PII recognizers (PERSON, LOCATION, + # DATE_TIME, PHONE_NUMBER, US_SSN, EMAIL_ADDRESS, etc.). + # + # HL7_MRN: 6-12 digit numeric, looser than NPI's strict 10-digit rule. + # Catches "check 623000286" prose where the keyword-based tier-2 missed. + analyzer.registry.add_recognizer( + PatternRecognizer( + supported_entity="HL7_MRN", + patterns=[Pattern("hl7_mrn_6_12", r"\b\d{6,12}\b", 0.30)], + context=["mrn", "patient", "record", "account", "acct", "visit", "encounter", "csn"], + ) + ) + # HL7_CARET_NAME: "SMITH^JOHN" / "SMITH^JOHN^Q" pattern outside Tier-3 + # context. The v0.7.3 Tier-3 only fires when PID.3/PID.5/etc. is in the + # same line; this recognizer catches the caret-name itself. + analyzer.registry.add_recognizer( + PatternRecognizer( + supported_entity="HL7_CARET_NAME", + patterns=[Pattern("caret_name", r"\b[A-Z][A-Z\-']+\^[A-Z][A-Z\-']+(\^[A-Z][A-Z\-']+)?\b", 0.85)], + ) + ) + # HL7_BARE_PHONE_10: plain "5551234567" (no dashes/parens) — Tier 1 + # requires formatting. Limit confidence so plain numbers in code stay safe. + analyzer.registry.add_recognizer( + PatternRecognizer( + supported_entity="HL7_PHONE_BARE", + patterns=[Pattern("phone_10_bare", r"\b[2-9]\d{2}[2-9]\d{6}\b", 0.20)], + context=["phone", "tel", "telephone", "contact", "cell", "mobile"], + ) + ) + return analyzer + + +def main(): + log.warning("loading presidio (this takes ~5-10 seconds the first time)...") + t0 = time.time() + analyzer = build_analyzer() + anonymizer = AnonymizerEngine() + log.warning(f"presidio ready in {(time.time()-t0)*1000:.0f} ms; listening on {LARRY_PHI_HOST}:{LARRY_PHI_PORT}") + + app = FastAPI(title="larry-phi-sidecar", version="0.8.2") + + @app.post("/redact") + def redact(req: RedactReq = Body(...)): + t0 = time.time() + results = analyzer.analyze(text=req.text, language="en", score_threshold=req.score_threshold) + anon = anonymizer.anonymize(text=req.text, analyzer_results=results) + return { + "redacted": anon.text, + "entities": [ + {"type": r.entity_type, "start": r.start, "end": r.end, "score": r.score} + for r in results + ], + "latency_ms": (time.time() - t0) * 1000, + } + + @app.get("/health") + def health(): + return {"status": "ok", "model": LARRY_PHI_MODEL, "port": LARRY_PHI_PORT} + + uvicorn.run(app, host=LARRY_PHI_HOST, port=LARRY_PHI_PORT, log_level="warning") + + +if __name__ == "__main__": + try: + main() + except KeyboardInterrupt: + sys.exit(0) diff --git a/lib/phi-sidecar.sh b/lib/phi-sidecar.sh new file mode 100755 index 0000000..33ef9e4 --- /dev/null +++ b/lib/phi-sidecar.sh @@ -0,0 +1,192 @@ +#!/usr/bin/env bash +# ───────────────────────────────────────────────────────────────────────────── +# larry-anywhere v0.8.2: PHI Presidio sidecar lifecycle +# +# Manages the local Presidio FastAPI service used by auto-PHI tier-5 +# (free-text NER). Started once at larry-anywhere REPL boot (best-effort — +# never blocks larry's startup), reused across turns, torn down on exit. +# +# Subcommands: +# start — launch the sidecar in the background if not already up +# stop — gracefully terminate the sidecar (TERM, then KILL) +# status — report up/down + port + pid +# health — curl /health endpoint (one-shot) +# ensure — start if not up; quick no-op if up. Idempotent. The +# primary entry point for larry.sh launch flow. +# +# Env: +# LARRY_PHI_PORT default 41189 +# LARRY_PHI_HOST default 127.0.0.1 +# LARRY_PHI_PYTHON default python3 +# LARRY_PHI_VENV optional path to a virtualenv; if set, uses +# $LARRY_PHI_VENV/bin/python instead +# LARRY_HOME stores PID file at $LARRY_HOME/.phi-sidecar.pid +# and stderr log at $LARRY_HOME/log/phi-sidecar.log +# +# Failure handling: +# If the sidecar can't start (missing deps, port collision, model missing), +# `start` returns non-zero with a stderr explanation. Callers in larry.sh +# MUST treat sidecar absence as "tier-5 disabled" — don't block the turn. +# ───────────────────────────────────────────────────────────────────────────── +set -uo pipefail + +LARRY_HOME="${LARRY_HOME:-$HOME/.larry}" +LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}" +LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}" +LARRY_PHI_PYTHON="${LARRY_PHI_PYTHON:-python3}" + +_PHI_SCRIPT_DIR="$(cd "$(dirname "$0")" 2>/dev/null && pwd)" +_PHI_SIDECAR_PY="$_PHI_SCRIPT_DIR/phi-presidio-sidecar.py" +_PHI_PID_FILE="$LARRY_HOME/.phi-sidecar.pid" +_PHI_LOG_FILE="$LARRY_HOME/log/phi-sidecar.log" + +# Coerce CR-tainted port number (Cygwin defense — v0.7.5 lesson). +if [ -f "$_PHI_SCRIPT_DIR/cygwin-safe.sh" ]; then + # shellcheck source=cygwin-safe.sh + . "$_PHI_SCRIPT_DIR/cygwin-safe.sh" 2>/dev/null || true +fi +if declare -F coerce_int >/dev/null 2>&1; then + LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189) +fi + +_phi_python() { + if [ -n "${LARRY_PHI_VENV:-}" ] && [ -x "$LARRY_PHI_VENV/bin/python" ]; then + printf '%s' "$LARRY_PHI_VENV/bin/python" + return + fi + if command -v "$LARRY_PHI_PYTHON" >/dev/null 2>&1; then + printf '%s' "$LARRY_PHI_PYTHON" + return + fi + printf '' +} + +_phi_is_up() { + # Health check via curl (lightweight). Don't trust the PID file alone — + # process could be a stale pid for an unrelated python. + curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1 +} + +cmd_status() { + if _phi_is_up; then + local body; body=$(curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" 2>/dev/null) + printf 'phi-sidecar: up — %s (pid %s)\n' "$body" "$(cat "$_PHI_PID_FILE" 2>/dev/null || echo unknown)" + return 0 + fi + printf 'phi-sidecar: down\n' + return 1 +} + +cmd_health() { + curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" 2>/dev/null + local rc=$? + if [ "$rc" != "0" ]; then + printf '{"status":"down","error":"unreachable on %s:%s"}\n' "$LARRY_PHI_HOST" "$LARRY_PHI_PORT" >&2 + return 1 + fi + echo + return 0 +} + +cmd_start() { + if _phi_is_up; then + cmd_status + return 0 + fi + local py; py=$(_phi_python) + if [ -z "$py" ]; then + printf 'phi-sidecar: cannot start — python3 not on PATH (set LARRY_PHI_PYTHON or LARRY_PHI_VENV)\n' >&2 + return 4 + fi + if [ ! -r "$_PHI_SIDECAR_PY" ]; then + printf 'phi-sidecar: cannot start — %s missing\n' "$_PHI_SIDECAR_PY" >&2 + return 4 + fi + # Quick dependency probe (don't load the model — that takes 9s. Just + # check imports succeed). If this fails, exit early with a clear message. + if ! "$py" -c 'import presidio_analyzer, presidio_anonymizer, fastapi, uvicorn' 2>/dev/null; then + printf 'phi-sidecar: cannot start — presidio_analyzer / presidio_anonymizer / fastapi / uvicorn not installed for %s\n' "$py" >&2 + printf ' install with: %s -m pip install presidio_analyzer presidio_anonymizer fastapi uvicorn\n' "$py" >&2 + printf ' then: %s -m spacy download en_core_web_sm\n' "$py" >&2 + return 5 + fi + mkdir -p "$(dirname "$_PHI_PID_FILE")" "$(dirname "$_PHI_LOG_FILE")" 2>/dev/null + LARRY_PHI_PORT="$LARRY_PHI_PORT" LARRY_PHI_HOST="$LARRY_PHI_HOST" \ + nohup "$py" "$_PHI_SIDECAR_PY" >> "$_PHI_LOG_FILE" 2>&1 & + local pid=$! + echo "$pid" > "$_PHI_PID_FILE" + # Wait up to 30 seconds for the model to load + the FastAPI port to open. + local i + for i in $(seq 1 30); do + sleep 1 + if _phi_is_up; then + printf 'phi-sidecar: started in %ds (pid %s, port %s)\n' "$i" "$pid" "$LARRY_PHI_PORT" >&2 + return 0 + fi + # If the python process died, surface the tail of the log. + if ! kill -0 "$pid" 2>/dev/null; then + printf 'phi-sidecar: process died during startup; tail of log:\n' >&2 + tail -20 "$_PHI_LOG_FILE" >&2 + rm -f "$_PHI_PID_FILE" + return 6 + fi + done + printf 'phi-sidecar: did not become healthy within 30s; tail of log:\n' >&2 + tail -20 "$_PHI_LOG_FILE" >&2 + return 7 +} + +cmd_stop() { + local pid="" + [ -f "$_PHI_PID_FILE" ] && pid=$(cat "$_PHI_PID_FILE" 2>/dev/null) + if [ -z "$pid" ]; then + printf 'phi-sidecar: no pid file\n' + return 0 + fi + if kill -0 "$pid" 2>/dev/null; then + kill -TERM "$pid" 2>/dev/null + local i + for i in 1 2 3 4 5; do + sleep 1 + kill -0 "$pid" 2>/dev/null || break + done + if kill -0 "$pid" 2>/dev/null; then + kill -KILL "$pid" 2>/dev/null + fi + fi + rm -f "$_PHI_PID_FILE" + printf 'phi-sidecar: stopped (pid %s)\n' "$pid" +} + +cmd_ensure() { + if _phi_is_up; then + return 0 + fi + cmd_start +} + +case "${1:-}" in + start) shift; cmd_start "$@" ;; + stop) shift; cmd_stop "$@" ;; + status) shift; cmd_status "$@" ;; + health) shift; cmd_health "$@" ;; + ensure) shift; cmd_ensure "$@" ;; + ""|help|-h|--help) + cat <&2; exit 2 ;; +esac