The only path that closes V1 (free-text PHI gap — the dominant real-world
failure mode per Vera). Opt-in install; larry runs in v0.8.1 mode on hosts
without Presidio (MobaXterm/Cygwin per Bryan's accepted tradeoff).
New files:
- lib/phi-presidio-sidecar.py — FastAPI service on 127.0.0.1:$LARRY_PHI_PORT
(default 41189). Presidio AnalyzerEngine + AnonymizerEngine over spaCy
en_core_web_sm + 3 HL7-specific custom recognizers (HL7_MRN, HL7_CARET_NAME,
HL7_PHONE_BARE). POST /redact and GET /health.
- lib/phi-sidecar.sh — lifecycle (start/stop/status/health/ensure). ensure
is idempotent; called backgrounded from main_loop so it never blocks the
first prompt. Honors LARRY_PHI_VENV.
- lib/phi-client.sh — bash client (phi_client_available / phi_redact_text /
phi_redact_entities). CR-safe; 5s timeout bounds tier-5 stall.
larry.sh:
- auto_detect_phi gains tier-5: after tiers 1-4, before status summary,
source phi-client.sh, run Presidio on a token-masked copy of the input,
tokenize each entity through hl7-sanitize.sh tokenize-value (category
presidio_<TYPE>) so token IDs stay stable. Honors confirm + strict modes.
Removed the v0.7.3 early-return that skipped past tier-5 when tiers 1-4
found nothing — pure prose now always reaches tier-5.
- Token-safe substitution: existing [[...]] tokens are pulled to sentinels,
tier-5 value is replaced, sentinels restored — prevents the token-within-
token corruption that naive literal-replace caused on already-tokenized
text. Acronym guard drops HL7/clinical jargon (SSN/MRN/DOB/ADT) Presidio
over-tags as ORGANIZATION.
- Graceful degradation: sidecar unreachable → tier-5 no-ops with a one-time
stderr warning. /phi-sidecar slash command + completion table.
install-larry.sh:
- Probes python3 3.9+; offers to create $LARRY_HOME/phi-venv and install
presidio + fastapi + uvicorn + en_core_web_sm. Skips silently (with a
v0.8.1-mode note) on Cygwin/MobaXterm without python3, and on
non-interactive pipe installs. Sets LARRY_PHI_VENV in the larry shim.
MANIFEST: three new lib files added for auto-sync.
Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14):
cold start (en_core_web_sm): ~9s (vs ~82s if Presidio auto-grabs _lg;
we pin _sm for the REPL budget)
warm analyzer latency: P50 20.6ms / P95 22.7ms
end-to-end HTTP round-trip: ~57ms warm; ~150ms first-post-startup
All comfortably under the 200ms-per-turn budget.
MobaXterm verdict: v0.8.2 is Mac/Linux-only. MobaXterm stays on v0.8.1 +
nudges, per Bryan's explicit acceptance. install-larry.sh enforces this
by platform detection; larry.sh tier-5 silently no-ops when the sidecar
is absent (which IS the MobaXterm path — no code is platform-gated).
Verification: bash -n clean on larry.sh + all 3 new lib scripts; python3
ast.parse clean on the sidecar; end-to-end tier-5 tested live against the
sidecar (pure prose, rule-pack+tier-5 combined with no token corruption,
!nophi bypass); strict-mode fail-closed abort tested; CR-taint, path-block,
and base64 round-trip batteries re-run green.
Co-Authored-By: Clover (Claude Opus 4.7) <noreply@anthropic.com>
118 lines
4.9 KiB
Bash
Executable File
118 lines
4.9 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# ─────────────────────────────────────────────────────────────────────────────
|
|
# larry-anywhere v0.8.2: PHI Presidio client
|
|
#
|
|
# Bash wrapper around the Presidio sidecar's /redact endpoint. Sourced from
|
|
# larry.sh's auto-PHI pipeline as the tier-5 free-text NER pass.
|
|
#
|
|
# Functions (sourced):
|
|
# phi_client_available — 0 if sidecar reachable; 1 otherwise
|
|
# phi_redact_text TEXT — echo redacted form on stdout; non-zero on failure
|
|
# (in which case caller leaves TEXT unchanged —
|
|
# "fail-open" is the right call for tier-5 alone)
|
|
# Standalone:
|
|
# ./phi-client.sh check — health probe
|
|
# ./phi-client.sh redact "the patient ..." — one-shot redact
|
|
#
|
|
# Wire-up in larry.sh:auto_detect_phi:
|
|
# - After tier-1/2/3/4 produce hits and tokenize, BEFORE add_user_text,
|
|
# call phi_redact_text on the (already-partially-tokenized) input.
|
|
# - For each entity returned with score > threshold, tokenize via
|
|
# hl7-sanitize.sh's tokenize-value (category = presidio_<entity_type>)
|
|
# to maintain stable token IDs across surfaces.
|
|
# ─────────────────────────────────────────────────────────────────────────────
|
|
LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}"
|
|
LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}"
|
|
LARRY_PHI_TIMEOUT="${LARRY_PHI_TIMEOUT:-5}" # seconds — bounds tier-5 stall
|
|
|
|
# Defense against CR-tainted env (Cygwin v0.7.5 lesson).
|
|
_phi_client_dir="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" 2>/dev/null && pwd)"
|
|
if [ -f "$_phi_client_dir/cygwin-safe.sh" ]; then
|
|
# shellcheck source=cygwin-safe.sh
|
|
. "$_phi_client_dir/cygwin-safe.sh" 2>/dev/null || true
|
|
fi
|
|
if declare -F coerce_int >/dev/null 2>&1; then
|
|
LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189)
|
|
LARRY_PHI_TIMEOUT=$(coerce_int "$LARRY_PHI_TIMEOUT" 5)
|
|
fi
|
|
|
|
phi_client_available() {
|
|
curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1
|
|
}
|
|
|
|
# phi_redact_text TEXT → emits redacted TEXT on stdout, non-zero on any failure.
|
|
# JSON-quoting handled via jq so payload is safe for any control chars.
|
|
phi_redact_text() {
|
|
local text="$1"
|
|
[ -z "$text" ] && { printf ''; return 0; }
|
|
# Build JSON payload via jq -n --arg (handles all escaping correctly).
|
|
local payload
|
|
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
|
|
local resp
|
|
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
|
|
-X POST -H 'Content-Type: application/json' \
|
|
--data-binary "$payload" \
|
|
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
|
|
# Parse out the redacted text. Empty → upstream error.
|
|
local redacted
|
|
redacted=$(printf '%s' "$resp" | jq -r '.redacted // empty' 2>/dev/null) || return 4
|
|
[ -z "$redacted" ] && return 5
|
|
printf '%s' "$redacted"
|
|
return 0
|
|
}
|
|
|
|
# Emit the entities (one per line: TYPE\tSTART\tEND\tSCORE\tVALUE) so the
|
|
# caller can re-tokenize with hl7-sanitize.sh's tokenize-value pipeline
|
|
# (categories: presidio_PERSON, presidio_LOCATION, etc.) for stable IDs.
|
|
phi_redact_entities() {
|
|
local text="$1"
|
|
[ -z "$text" ] && return 0
|
|
local payload resp
|
|
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
|
|
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
|
|
-X POST -H 'Content-Type: application/json' \
|
|
--data-binary "$payload" \
|
|
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
|
|
printf '%s' "$resp" | jq -r '
|
|
.entities[]? |
|
|
[.type, (.start|tostring), (.end|tostring), (.score|tostring), ($input[(.start|tonumber):(.end|tonumber)])] |
|
|
@tsv
|
|
' --argjson input "$(jq -nc --arg t "$text" '$t')" 2>/dev/null
|
|
}
|
|
|
|
# Standalone CLI mode (when run, not sourced).
|
|
if [ "${BASH_SOURCE[0]:-$0}" = "$0" ]; then
|
|
case "${1:-}" in
|
|
check)
|
|
if phi_client_available; then echo "phi-client: sidecar reachable"; exit 0
|
|
else echo "phi-client: sidecar unreachable on $LARRY_PHI_HOST:$LARRY_PHI_PORT" >&2; exit 1; fi
|
|
;;
|
|
redact)
|
|
shift
|
|
[ -z "${1:-}" ] && { echo "usage: phi-client.sh redact <text>" >&2; exit 2; }
|
|
phi_redact_text "$1"; echo
|
|
;;
|
|
entities)
|
|
shift
|
|
[ -z "${1:-}" ] && { echo "usage: phi-client.sh entities <text>" >&2; exit 2; }
|
|
phi_redact_entities "$1"
|
|
;;
|
|
*)
|
|
cat <<USAGE
|
|
phi-client.sh — larry-anywhere v0.8.2 Presidio client
|
|
|
|
check health probe
|
|
redact <text> emit redacted text
|
|
entities <text> emit entities (TYPE TAB START END SCORE VALUE)
|
|
|
|
Functions (when sourced):
|
|
phi_client_available
|
|
phi_redact_text <text>
|
|
phi_redact_entities <text>
|
|
|
|
Env: LARRY_PHI_PORT (41189), LARRY_PHI_HOST (127.0.0.1), LARRY_PHI_TIMEOUT (5).
|
|
USAGE
|
|
;;
|
|
esac
|
|
fi
|