cloverleaf-larry/lib/phi-client.sh
Bryan Johnson 60b8f0e1c8 v0.8.2: Presidio sidecar for free-text NER (tier-5) — closes V1
The only path that closes V1 (free-text PHI gap — the dominant real-world
failure mode per Vera). Opt-in install; larry runs in v0.8.1 mode on hosts
without Presidio (MobaXterm/Cygwin per Bryan's accepted tradeoff).

New files:
- lib/phi-presidio-sidecar.py — FastAPI service on 127.0.0.1:$LARRY_PHI_PORT
  (default 41189). Presidio AnalyzerEngine + AnonymizerEngine over spaCy
  en_core_web_sm + 3 HL7-specific custom recognizers (HL7_MRN, HL7_CARET_NAME,
  HL7_PHONE_BARE). POST /redact and GET /health.
- lib/phi-sidecar.sh — lifecycle (start/stop/status/health/ensure). ensure
  is idempotent; called backgrounded from main_loop so it never blocks the
  first prompt. Honors LARRY_PHI_VENV.
- lib/phi-client.sh — bash client (phi_client_available / phi_redact_text /
  phi_redact_entities). CR-safe; 5s timeout bounds tier-5 stall.

larry.sh:
- auto_detect_phi gains tier-5: after tiers 1-4, before status summary,
  source phi-client.sh, run Presidio on a token-masked copy of the input,
  tokenize each entity through hl7-sanitize.sh tokenize-value (category
  presidio_<TYPE>) so token IDs stay stable. Honors confirm + strict modes.
  Removed the v0.7.3 early-return that skipped past tier-5 when tiers 1-4
  found nothing — pure prose now always reaches tier-5.
- Token-safe substitution: existing [[...]] tokens are pulled to sentinels,
  tier-5 value is replaced, sentinels restored — prevents the token-within-
  token corruption that naive literal-replace caused on already-tokenized
  text. Acronym guard drops HL7/clinical jargon (SSN/MRN/DOB/ADT) Presidio
  over-tags as ORGANIZATION.
- Graceful degradation: sidecar unreachable → tier-5 no-ops with a one-time
  stderr warning. /phi-sidecar slash command + completion table.

install-larry.sh:
- Probes python3 3.9+; offers to create $LARRY_HOME/phi-venv and install
  presidio + fastapi + uvicorn + en_core_web_sm. Skips silently (with a
  v0.8.1-mode note) on Cygwin/MobaXterm without python3, and on
  non-interactive pipe installs. Sets LARRY_PHI_VENV in the larry shim.

MANIFEST: three new lib files added for auto-sync.

Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14):
  cold start (en_core_web_sm): ~9s   (vs ~82s if Presidio auto-grabs _lg;
                                       we pin _sm for the REPL budget)
  warm analyzer latency:       P50 20.6ms / P95 22.7ms
  end-to-end HTTP round-trip:  ~57ms warm; ~150ms first-post-startup
All comfortably under the 200ms-per-turn budget.

MobaXterm verdict: v0.8.2 is Mac/Linux-only. MobaXterm stays on v0.8.1 +
nudges, per Bryan's explicit acceptance. install-larry.sh enforces this
by platform detection; larry.sh tier-5 silently no-ops when the sidecar
is absent (which IS the MobaXterm path — no code is platform-gated).

Verification: bash -n clean on larry.sh + all 3 new lib scripts; python3
ast.parse clean on the sidecar; end-to-end tier-5 tested live against the
sidecar (pure prose, rule-pack+tier-5 combined with no token corruption,
!nophi bypass); strict-mode fail-closed abort tested; CR-taint, path-block,
and base64 round-trip batteries re-run green.

Co-Authored-By: Clover (Claude Opus 4.7) <noreply@anthropic.com>
2026-05-27 20:00:23 -07:00

118 lines
4.9 KiB
Bash
Executable File

#!/usr/bin/env bash
# ─────────────────────────────────────────────────────────────────────────────
# larry-anywhere v0.8.2: PHI Presidio client
#
# Bash wrapper around the Presidio sidecar's /redact endpoint. Sourced from
# larry.sh's auto-PHI pipeline as the tier-5 free-text NER pass.
#
# Functions (sourced):
# phi_client_available — 0 if sidecar reachable; 1 otherwise
# phi_redact_text TEXT — echo redacted form on stdout; non-zero on failure
# (in which case caller leaves TEXT unchanged —
# "fail-open" is the right call for tier-5 alone)
# Standalone:
# ./phi-client.sh check — health probe
# ./phi-client.sh redact "the patient ..." — one-shot redact
#
# Wire-up in larry.sh:auto_detect_phi:
# - After tier-1/2/3/4 produce hits and tokenize, BEFORE add_user_text,
# call phi_redact_text on the (already-partially-tokenized) input.
# - For each entity returned with score > threshold, tokenize via
# hl7-sanitize.sh's tokenize-value (category = presidio_<entity_type>)
# to maintain stable token IDs across surfaces.
# ─────────────────────────────────────────────────────────────────────────────
LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}"
LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}"
LARRY_PHI_TIMEOUT="${LARRY_PHI_TIMEOUT:-5}" # seconds — bounds tier-5 stall
# Defense against CR-tainted env (Cygwin v0.7.5 lesson).
_phi_client_dir="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" 2>/dev/null && pwd)"
if [ -f "$_phi_client_dir/cygwin-safe.sh" ]; then
# shellcheck source=cygwin-safe.sh
. "$_phi_client_dir/cygwin-safe.sh" 2>/dev/null || true
fi
if declare -F coerce_int >/dev/null 2>&1; then
LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189)
LARRY_PHI_TIMEOUT=$(coerce_int "$LARRY_PHI_TIMEOUT" 5)
fi
phi_client_available() {
curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1
}
# phi_redact_text TEXT → emits redacted TEXT on stdout, non-zero on any failure.
# JSON-quoting handled via jq so payload is safe for any control chars.
phi_redact_text() {
local text="$1"
[ -z "$text" ] && { printf ''; return 0; }
# Build JSON payload via jq -n --arg (handles all escaping correctly).
local payload
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
local resp
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
-X POST -H 'Content-Type: application/json' \
--data-binary "$payload" \
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
# Parse out the redacted text. Empty → upstream error.
local redacted
redacted=$(printf '%s' "$resp" | jq -r '.redacted // empty' 2>/dev/null) || return 4
[ -z "$redacted" ] && return 5
printf '%s' "$redacted"
return 0
}
# Emit the entities (one per line: TYPE\tSTART\tEND\tSCORE\tVALUE) so the
# caller can re-tokenize with hl7-sanitize.sh's tokenize-value pipeline
# (categories: presidio_PERSON, presidio_LOCATION, etc.) for stable IDs.
phi_redact_entities() {
local text="$1"
[ -z "$text" ] && return 0
local payload resp
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
-X POST -H 'Content-Type: application/json' \
--data-binary "$payload" \
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
printf '%s' "$resp" | jq -r '
.entities[]? |
[.type, (.start|tostring), (.end|tostring), (.score|tostring), ($input[(.start|tonumber):(.end|tonumber)])] |
@tsv
' --argjson input "$(jq -nc --arg t "$text" '$t')" 2>/dev/null
}
# Standalone CLI mode (when run, not sourced).
if [ "${BASH_SOURCE[0]:-$0}" = "$0" ]; then
case "${1:-}" in
check)
if phi_client_available; then echo "phi-client: sidecar reachable"; exit 0
else echo "phi-client: sidecar unreachable on $LARRY_PHI_HOST:$LARRY_PHI_PORT" >&2; exit 1; fi
;;
redact)
shift
[ -z "${1:-}" ] && { echo "usage: phi-client.sh redact <text>" >&2; exit 2; }
phi_redact_text "$1"; echo
;;
entities)
shift
[ -z "${1:-}" ] && { echo "usage: phi-client.sh entities <text>" >&2; exit 2; }
phi_redact_entities "$1"
;;
*)
cat <<USAGE
phi-client.sh — larry-anywhere v0.8.2 Presidio client
check health probe
redact <text> emit redacted text
entities <text> emit entities (TYPE TAB START END SCORE VALUE)
Functions (when sourced):
phi_client_available
phi_redact_text <text>
phi_redact_entities <text>
Env: LARRY_PHI_PORT (41189), LARRY_PHI_HOST (127.0.0.1), LARRY_PHI_TIMEOUT (5).
USAGE
;;
esac
fi