v0.8.2: Presidio sidecar for free-text NER (tier-5) — closes V1
The only path that closes V1 (free-text PHI gap — the dominant real-world
failure mode per Vera). Opt-in install; larry runs in v0.8.1 mode on hosts
without Presidio (MobaXterm/Cygwin per Bryan's accepted tradeoff).
New files:
- lib/phi-presidio-sidecar.py — FastAPI service on 127.0.0.1:$LARRY_PHI_PORT
(default 41189). Presidio AnalyzerEngine + AnonymizerEngine over spaCy
en_core_web_sm + 3 HL7-specific custom recognizers (HL7_MRN, HL7_CARET_NAME,
HL7_PHONE_BARE). POST /redact and GET /health.
- lib/phi-sidecar.sh — lifecycle (start/stop/status/health/ensure). ensure
is idempotent; called backgrounded from main_loop so it never blocks the
first prompt. Honors LARRY_PHI_VENV.
- lib/phi-client.sh — bash client (phi_client_available / phi_redact_text /
phi_redact_entities). CR-safe; 5s timeout bounds tier-5 stall.
larry.sh:
- auto_detect_phi gains tier-5: after tiers 1-4, before status summary,
source phi-client.sh, run Presidio on a token-masked copy of the input,
tokenize each entity through hl7-sanitize.sh tokenize-value (category
presidio_<TYPE>) so token IDs stay stable. Honors confirm + strict modes.
Removed the v0.7.3 early-return that skipped past tier-5 when tiers 1-4
found nothing — pure prose now always reaches tier-5.
- Token-safe substitution: existing [[...]] tokens are pulled to sentinels,
tier-5 value is replaced, sentinels restored — prevents the token-within-
token corruption that naive literal-replace caused on already-tokenized
text. Acronym guard drops HL7/clinical jargon (SSN/MRN/DOB/ADT) Presidio
over-tags as ORGANIZATION.
- Graceful degradation: sidecar unreachable → tier-5 no-ops with a one-time
stderr warning. /phi-sidecar slash command + completion table.
install-larry.sh:
- Probes python3 3.9+; offers to create $LARRY_HOME/phi-venv and install
presidio + fastapi + uvicorn + en_core_web_sm. Skips silently (with a
v0.8.1-mode note) on Cygwin/MobaXterm without python3, and on
non-interactive pipe installs. Sets LARRY_PHI_VENV in the larry shim.
MANIFEST: three new lib files added for auto-sync.
Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14):
cold start (en_core_web_sm): ~9s (vs ~82s if Presidio auto-grabs _lg;
we pin _sm for the REPL budget)
warm analyzer latency: P50 20.6ms / P95 22.7ms
end-to-end HTTP round-trip: ~57ms warm; ~150ms first-post-startup
All comfortably under the 200ms-per-turn budget.
MobaXterm verdict: v0.8.2 is Mac/Linux-only. MobaXterm stays on v0.8.1 +
nudges, per Bryan's explicit acceptance. install-larry.sh enforces this
by platform detection; larry.sh tier-5 silently no-ops when the sidecar
is absent (which IS the MobaXterm path — no code is platform-gated).
Verification: bash -n clean on larry.sh + all 3 new lib scripts; python3
ast.parse clean on the sidecar; end-to-end tier-5 tested live against the
sidecar (pure prose, rule-pack+tier-5 combined with no token corruption,
!nophi bypass); strict-mode fail-closed abort tested; CR-taint, path-block,
and base64 round-trip batteries re-run green.
Co-Authored-By: Clover (Claude Opus 4.7) <noreply@anthropic.com>
This commit is contained in:
parent
9fc38e743d
commit
60b8f0e1c8
105
CHANGELOG.md
105
CHANGELOG.md
@ -4,6 +4,111 @@ All notable changes to `cloverleaf-larry` / `larry-anywhere` are recorded here.
|
||||
Versioning is loose-semver; bumps trigger the in-process self-update on every
|
||||
running client via `LARRY_BASE_URL` + `MANIFEST`.
|
||||
|
||||
## v0.8.2 — 2026-05-27
|
||||
|
||||
Microsoft Presidio sidecar for free-text NER. Closes V1 from Vera's audit —
|
||||
the dominant real-world failure mode (patient names, addresses, un-keyworded
|
||||
dates in prose chat). Opt-in install; larry runs in v0.8.1 mode on hosts
|
||||
where Presidio isn't installed (MobaXterm/Cygwin per Bryan's accepted
|
||||
tradeoff).
|
||||
|
||||
- **`lib/phi-presidio-sidecar.py`** — FastAPI service on
|
||||
`127.0.0.1:$LARRY_PHI_PORT` (default `41189`). Wraps Presidio's
|
||||
`AnalyzerEngine` + `AnonymizerEngine` over spaCy `en_core_web_sm`
|
||||
(12MB model, ~9-second cold start). Two endpoints: `POST /redact`
|
||||
takes `{"text": "..."}` and returns `{"redacted": "...", "entities":
|
||||
[...], "latency_ms": N}`; `GET /health` for the launcher's readiness
|
||||
probe. Three HL7-specific custom recognizers added (`HL7_MRN` for
|
||||
6-12 digit numerics with patient/MRN/account context; `HL7_CARET_NAME`
|
||||
for `SMITH^JOHN` outside Tier-3 line context; `HL7_PHONE_BARE` for
|
||||
plain 10-digit phones). Confidence threshold for tier-5 tokenize is
|
||||
0.3 (below that is too noisy).
|
||||
|
||||
- **`lib/phi-sidecar.sh`** — lifecycle launcher. Subcommands:
|
||||
`start / stop / status / health / ensure`. `ensure` is idempotent
|
||||
(no-op if already up); called from `larry.sh` main_loop startup,
|
||||
backgrounded so it never blocks larry's first prompt. Waits up to
|
||||
30 seconds for the sidecar to become healthy after `start`; surfaces
|
||||
the log tail if startup fails. PID file at
|
||||
`$LARRY_HOME/.phi-sidecar.pid`; log at `$LARRY_HOME/log/phi-sidecar.log`.
|
||||
Honors `LARRY_PHI_VENV` env to use a dedicated virtualenv (which the
|
||||
installer sets up at `$LARRY_HOME/phi-venv` when the user opts in).
|
||||
|
||||
- **`lib/phi-client.sh`** — bash wrapper around `/redact`. Sourceable
|
||||
functions: `phi_client_available`, `phi_redact_text`, `phi_redact_entities`.
|
||||
Also runs standalone as a CLI (`./phi-client.sh check / redact / entities`).
|
||||
CR-safe (sources `cygwin-safe.sh` defensively); 5-second curl timeout
|
||||
bounds any tier-5 stall.
|
||||
|
||||
- **Tier-5 integration in `larry.sh:auto_detect_phi`.** New stage AFTER
|
||||
the existing tier-1/2/3/4 substitution and BEFORE the status summary.
|
||||
Sources `phi-client.sh` lazily, probes `phi_client_available`, and on
|
||||
success runs `phi_redact_entities` to get Presidio's per-entity output.
|
||||
Each entity is tokenized through the SAME `hl7-sanitize.sh tokenize-value`
|
||||
pipeline as tiers 1-4 (category prefixed `presidio_<TYPE>`) so token IDs
|
||||
remain stable across surfaces and the `/tokens` listing stays unified.
|
||||
Tier-5 honors `LARRY_AUTO_PHI=confirm` (prompts Y/n once per value) and
|
||||
`strict` (aborts the turn if `tokenize-value` fails on a Presidio hit).
|
||||
Critically, v0.8.2 removes the v0.7.3 early-return that exited
|
||||
`auto_detect_phi` when tiers 1-4 found nothing — pure-prose input now
|
||||
ALWAYS reaches tier-5.
|
||||
|
||||
- **Graceful degradation.** If the sidecar is unreachable (not installed,
|
||||
not started, crashed), tier-5 silently no-ops with a one-time stderr
|
||||
warning per session. Larry's REPL remains fully functional in v0.8.1
|
||||
mode. `LARRY_AUTO_PHI=strict` does NOT abort on absent sidecar (the
|
||||
strict mode escape is for HL7-shaped content where rule-pack would
|
||||
have caught the leak; tier-5 is additive coverage).
|
||||
|
||||
- **`/phi-sidecar` slash command** — `start / stop / status / health /
|
||||
ensure` exposed to the user. Slash-completion table and `_LARRY_SLASH_CMDS_DESC`
|
||||
updated.
|
||||
|
||||
- **`install-larry.sh` install path.** On hosts with Python 3.9+ + pip,
|
||||
the installer prompts before creating `$LARRY_HOME/phi-venv` and
|
||||
installing `presidio_analyzer + presidio_anonymizer + fastapi +
|
||||
uvicorn + spaCy en_core_web_sm` (~400MB on disk, ~250MB RAM resident).
|
||||
On MobaXterm/Cygwin without python3, the installer skips the prompt
|
||||
entirely and prints Bryan's accepted tradeoff (MobaXterm stays on
|
||||
v0.8.1 + nudges). Re-runnable; idempotent.
|
||||
|
||||
- **MANIFEST.** Added three new lib files. They auto-sync to every
|
||||
running client on next launch; clients without Python 3 won't run
|
||||
the sidecar but the files are harmless to ship.
|
||||
|
||||
**Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14).**
|
||||
Cold start (model load): ~9 seconds with `en_core_web_sm` (vs ~82s with
|
||||
the larger `en_core_web_lg` Presidio auto-downloads by default — we
|
||||
explicitly pin `_sm` for the latency-sensitive REPL use case). Warm
|
||||
analyzer latency: P50 20.6ms, P95 22.7ms over 20 sequential requests
|
||||
on 100-word input. End-to-end HTTP round-trip (curl + json roundtrip):
|
||||
P50 ~57ms warm; first request post-startup pays a ~150ms tokenizer
|
||||
warmup tax then steady. Well under the 200ms-per-turn REPL budget.
|
||||
|
||||
Detection quality on the canonical "John Doe MRN 623000286" sample: 8
|
||||
core entities caught (PERSON x2, DATE_TIME x2, PHONE_NUMBER, US_*),
|
||||
plus the three custom HL7 recognizers add MRN + caret-name + bare-phone
|
||||
coverage. Misclassifications (MRN as US_PASSPORT, "ED" as PERSON) are
|
||||
within tolerance for the tokenize-everything-suspicious policy — the
|
||||
auto-PHI lookup table sees them as `presidio_*` categories and the
|
||||
operator can audit via `/tokens`.
|
||||
|
||||
**MobaXterm compatibility verdict.** Per Bryan's accepted tradeoff:
|
||||
v0.8.2 ships Mac/Linux-only. MobaXterm/Cygwin stays on v0.8.1
|
||||
(rule-pack + path-block + content-shape gating + strict mode + base64
|
||||
round-trip + tool-result review gate). Test path: install-larry.sh
|
||||
detects platform and skips the Presidio install on `windows-cygwin`
|
||||
with a clear "v0.8.1 mode" note. No code in larry.sh is platform-gated
|
||||
— tier-5 silently no-ops when the sidecar is absent, which IS the
|
||||
MobaXterm path.
|
||||
|
||||
**Proactive same-pattern sweep.** Searched for other call sites where
|
||||
free-text NER would help: tool-result surface already gets HL7-shape
|
||||
sanitize (v0.8.1) and base64 round-trip (v0.8.1-c). Tier-5 is
|
||||
user_input-only by design — tool-result free-text NER deferred to a
|
||||
future patch (would require deciding on per-tool latency budgets;
|
||||
Bryan to call when needed).
|
||||
|
||||
## v0.8.1 — 2026-05-27
|
||||
|
||||
Tool-result PHI gating expansion. Closes V2 / V12 and the V2 base64 sub-gap
|
||||
|
||||
10
MANIFEST
10
MANIFEST
@ -44,6 +44,16 @@ lib/hl7-diff.sh
|
||||
lib/hl7-field.sh
|
||||
lib/hl7-schema.sh
|
||||
|
||||
# v0.8.2: Microsoft Presidio sidecar (optional, opt-in install).
|
||||
# Closes V1 free-text PHI gap from Vera's audit. Requires Python 3.9+ and
|
||||
# pip install presidio_analyzer + presidio_anonymizer + fastapi + uvicorn
|
||||
# + spaCy en_core_web_sm. install-larry.sh offers to install on first run.
|
||||
# Larry's tier-5 silently skips when sidecar is unreachable, so syncing
|
||||
# these files is safe even on hosts where Python deps aren't installed.
|
||||
lib/phi-presidio-sidecar.py
|
||||
lib/phi-sidecar.sh
|
||||
lib/phi-client.sh
|
||||
|
||||
# Generic helpers
|
||||
lib/each.sh
|
||||
lib/each-site.sh
|
||||
|
||||
@ -192,6 +192,87 @@ else
|
||||
warn "cannot write to $LARRY_BIN_DIR — invoke larry directly as: $LARRY_HOME/larry.sh"
|
||||
fi
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# v0.8.2 — optional PHI Presidio sidecar (free-text NER).
|
||||
# Closes V1 from Vera's PHI-leak audit. Opt-in install; larry runs in
|
||||
# v0.8.1 mode (rule-pack only) on hosts where this isn't installed.
|
||||
# We probe for python3 + pip, then offer the install. Skip silently if
|
||||
# python3 isn't available — keeps the install one-shot on raw MobaXterm
|
||||
# where Python may not be present.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
if command -v python3 >/dev/null 2>&1; then
|
||||
PYV=$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null || echo "")
|
||||
case "$PYV" in
|
||||
3.9|3.10|3.11|3.12|3.13|3.14|3.15) PY_OK=1 ;;
|
||||
*) PY_OK=0 ;;
|
||||
esac
|
||||
if [ "${PY_OK:-0}" = "1" ]; then
|
||||
say "v0.8.2: Presidio PHI sidecar is available (python $PYV detected)"
|
||||
echo " Presidio provides free-text NER (names, addresses, dates in prose)"
|
||||
echo " that the regex tiers miss. Install adds presidio_analyzer +"
|
||||
echo " presidio_anonymizer + fastapi + uvicorn + spaCy en_core_web_sm"
|
||||
echo " to a dedicated virtualenv at $LARRY_HOME/phi-venv (~400MB on disk,"
|
||||
echo " ~250MB RAM resident when running). One-time cost; tier-5 NER"
|
||||
echo " then runs on every prompt with ~20ms latency."
|
||||
echo ""
|
||||
# Heuristic: if stdin is a TTY, prompt. Otherwise (curl|bash pipe), skip.
|
||||
INSTALL_PHI=""
|
||||
if [ -t 0 ]; then
|
||||
printf 'install Presidio sidecar now? [y/N]: '
|
||||
read -r INSTALL_PHI </dev/tty || INSTALL_PHI=""
|
||||
else
|
||||
echo " (non-interactive install — skip; rerun installer with TTY or set"
|
||||
echo " LARRY_INSTALL_PHI=1 to enable. To install manually later:"
|
||||
echo " python3 -m venv $LARRY_HOME/phi-venv"
|
||||
echo " $LARRY_HOME/phi-venv/bin/pip install presidio_analyzer presidio_anonymizer fastapi uvicorn"
|
||||
echo " $LARRY_HOME/phi-venv/bin/python -m spacy download en_core_web_sm)"
|
||||
INSTALL_PHI="${LARRY_INSTALL_PHI:-n}"
|
||||
fi
|
||||
case "${INSTALL_PHI:-}" in
|
||||
y|Y|yes|YES|1)
|
||||
say "installing Presidio sidecar to $LARRY_HOME/phi-venv (this takes 2-5 minutes)..."
|
||||
if python3 -m venv "$LARRY_HOME/phi-venv" >/dev/null 2>&1; then
|
||||
if "$LARRY_HOME/phi-venv/bin/pip" install --quiet \
|
||||
presidio_analyzer presidio_anonymizer fastapi uvicorn >/dev/null 2>&1; then
|
||||
if "$LARRY_HOME/phi-venv/bin/python" -m spacy download en_core_web_sm \
|
||||
>/dev/null 2>&1; then
|
||||
ok "Presidio sidecar installed (venv: $LARRY_HOME/phi-venv)"
|
||||
# Set LARRY_PHI_VENV in the shim so larry auto-uses it.
|
||||
if [ -f "$LARRY_BIN_DIR/larry" ]; then
|
||||
sed -i.bak "s|^exec \"|export LARRY_PHI_VENV=\"$LARRY_HOME/phi-venv\"\nexec \"|" \
|
||||
"$LARRY_BIN_DIR/larry" 2>/dev/null || true
|
||||
rm -f "$LARRY_BIN_DIR/larry.bak"
|
||||
fi
|
||||
else
|
||||
warn "spaCy en_core_web_sm download failed; sidecar will not start until model is present"
|
||||
fi
|
||||
else
|
||||
warn "pip install failed; Presidio sidecar not available on this host (larry runs in v0.8.1 mode)"
|
||||
fi
|
||||
else
|
||||
warn "python3 -m venv failed; cannot install Presidio (larry runs in v0.8.1 mode)"
|
||||
fi
|
||||
;;
|
||||
*)
|
||||
ok "skipped Presidio install — larry runs in v0.8.1 mode (rule-pack auto-PHI only)"
|
||||
;;
|
||||
esac
|
||||
else
|
||||
warn "python3 detected but version ($PYV) is not 3.9+; Presidio sidecar requires 3.9+"
|
||||
warn "larry runs in v0.8.1 mode (rule-pack auto-PHI only) on this host"
|
||||
fi
|
||||
else
|
||||
case "$PLATFORM" in
|
||||
windows-cygwin)
|
||||
warn "python3 not detected on Cygwin/MobaXterm. v0.8.2 Presidio sidecar SKIPPED."
|
||||
warn "Bryan's accepted tradeoff: MobaXterm stays on v0.8.1 + prompt nudges."
|
||||
;;
|
||||
*)
|
||||
warn "python3 not on PATH; Presidio sidecar skipped (larry runs in v0.8.1 mode)"
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Done
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
190
larry.sh
190
larry.sh
@ -57,7 +57,7 @@ set -o pipefail
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Config
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
LARRY_VERSION="0.8.1"
|
||||
LARRY_VERSION="0.8.2"
|
||||
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
@ -1753,8 +1753,14 @@ auto_detect_phi() {
|
||||
done
|
||||
done <<< "$scan"
|
||||
|
||||
[ -z "$hits" ] && { printf '%s' "$input"; return 0; }
|
||||
# v0.8.2: don't early-return when tiers 1-4 found nothing — tier-5
|
||||
# (Presidio NER) is the WHOLE POINT of catching free-text gaps. We run
|
||||
# tier-5 below regardless of $hits. Per-category counters stay scoped
|
||||
# at function level so both tier-1-4 and tier-5 share the summary.
|
||||
local -A cat_count=()
|
||||
|
||||
# Tier-1-4 substitution (skipped when no hits).
|
||||
if [ -n "$hits" ]; then
|
||||
# Dedupe hits (preserving first-seen order).
|
||||
local seen_hash=""
|
||||
local uniq_hits=""
|
||||
@ -1768,9 +1774,6 @@ auto_detect_phi() {
|
||||
uniq_hits+="$h"$'\n'
|
||||
done <<< "$hits"
|
||||
|
||||
# Per-category counters for the status summary.
|
||||
local -A cat_count=()
|
||||
|
||||
while IFS= read -r h; do
|
||||
[ -z "$h" ] && continue
|
||||
local tier="${h%%|*}"; local rest="${h#*|}"
|
||||
@ -1808,6 +1811,152 @@ auto_detect_phi() {
|
||||
local ctx; ctx=$(printf '%s' "$scan" | grep -F -- "$orig" | head -1 | head -c 80)
|
||||
_auto_phi_log "$orig" "$cat" "$token" "$tier" "$surface" "$ctx"
|
||||
done <<< "$uniq_hits"
|
||||
fi # end: if [ -n "$hits" ] — v0.8.2 wrapper so tier-5 runs unconditionally
|
||||
|
||||
# v0.8.2 — Tier-5: free-text NER via Presidio sidecar.
|
||||
# Runs AFTER tier-1/2/3/4 (so explicit-marker tokens stay stable and known
|
||||
# values already have their canonical tokens) but BEFORE the status summary.
|
||||
# Tier-5 catches what the regex+keyword tiers miss: bare patient names in
|
||||
# prose ("the patient John Doe..."), addresses without keyword context,
|
||||
# un-keyworded dates, generic phone numbers. Closes V1 from Vera's audit.
|
||||
#
|
||||
# Graceful degradation: if the sidecar isn't reachable (not installed,
|
||||
# not started, crashed), tier-5 silently no-ops — preserves v0.8.1 behavior.
|
||||
# The one exception is LARRY_AUTO_PHI=strict on HL7-shaped input — handled
|
||||
# at the top of this function already.
|
||||
if [ "$AUTO_PHI_MODE" != "off" ] \
|
||||
&& [ -r "$LARRY_LIB_DIR/phi-client.sh" ]; then
|
||||
# Source the client lazily (per-call). The functions are tiny and
|
||||
# sourcing each turn lets users update the client without restart.
|
||||
# shellcheck source=lib/phi-client.sh
|
||||
. "$LARRY_LIB_DIR/phi-client.sh" 2>/dev/null
|
||||
if declare -F phi_client_available >/dev/null 2>&1 && phi_client_available; then
|
||||
# Run Presidio on a copy where already-minted [[CAT_NNNN]] tokens are
|
||||
# masked to neutral fixed-width placeholders. This stops Presidio from
|
||||
# tagging text that spans an existing token (which would then corrupt
|
||||
# the token when we literal-replace). We map placeholder→token so the
|
||||
# entity offsets still align, but since we substitute by VALUE (not
|
||||
# offset) below, the mask just needs to remove tokens from Presidio's
|
||||
# view. We use a regex-neutral run of 'x' the same length per token.
|
||||
local _t5_scan="$input"
|
||||
# Replace each [[...]] token with same-length x-run so offsets are
|
||||
# preserved and Presidio sees no bracket structure.
|
||||
_t5_scan=$(printf '%s' "$_t5_scan" | sed -E 's/\[\[[A-Za-z0-9_]+\]\]/XXXXXXXXXX/g')
|
||||
local _t5_entities
|
||||
_t5_entities=$(phi_redact_entities "$_t5_scan" 2>/dev/null) || _t5_entities=""
|
||||
if [ -n "$_t5_entities" ]; then
|
||||
# Format: TYPE\tSTART\tEND\tSCORE\tVALUE per line.
|
||||
# Sort by descending start offset so substituting longest/latest first
|
||||
# doesn't shift earlier offsets (we're using literal string-replace,
|
||||
# but stable ordering keeps the audit log sensible).
|
||||
local _t5_count=0 _t5_line _t5_type _t5_value _t5_score _t5_cat _t5_token
|
||||
while IFS=$'\t' read -r _t5_type _t5_start _t5_end _t5_score _t5_value; do
|
||||
[ -z "$_t5_value" ] && continue
|
||||
# Drop low-confidence noise. Bryan's tier-3/4 strictness applies
|
||||
# equally here — confidence < 0.3 is too noisy for auto-tokenize.
|
||||
local _t5_int_score
|
||||
_t5_int_score=$(printf '%s' "$_t5_score" | awk '{print int($1*100)}')
|
||||
if [ "${_t5_int_score:-0}" -lt 30 ]; then continue; fi
|
||||
# Skip values that look like HL7 field refs or paths (shared
|
||||
# blacklists with the per-word classifier).
|
||||
if declare -F _auto_phi_skip_path_like >/dev/null 2>&1; then
|
||||
_auto_phi_skip_path_like "$_t5_value" && continue
|
||||
fi
|
||||
if declare -F _auto_phi_skip_version >/dev/null 2>&1; then
|
||||
_auto_phi_skip_version "$_t5_value" && continue
|
||||
fi
|
||||
# Skip if the value is already a token (don't double-tokenize).
|
||||
case "$_t5_value" in
|
||||
\[\[*\]\]) continue ;;
|
||||
*\[\[*) continue ;; # value spans/contains a token fragment
|
||||
*XXXXXXXXXX*) continue ;; # value spans a masked token placeholder
|
||||
esac
|
||||
# Noise guard: drop bare uppercase field-label acronyms Presidio
|
||||
# over-eagerly tags as ORGANIZATION ("SSN", "MRN", "DOB", "ED",
|
||||
# "Phone", "ADT"). These are HL7/clinical jargon, not PHI. We keep
|
||||
# them out of the tokenize set to avoid (a) noise and (b) the
|
||||
# substring-corruption class (a 3-letter value substring-matching
|
||||
# inside another token). A real name is mixed-case or multi-word.
|
||||
case "$_t5_value" in
|
||||
[A-Z][A-Z]|[A-Z][A-Z][A-Z]|[A-Z][A-Z][A-Z][A-Z]) continue ;;
|
||||
esac
|
||||
# Skip very short single tokens (< 3 chars) — too collision-prone
|
||||
# for literal-string replace.
|
||||
if [ "${#_t5_value}" -lt 3 ]; then continue; fi
|
||||
# Token-safe substitution guard: if the value occurs ONLY as a
|
||||
# substring of an existing [[...]] token in the current input,
|
||||
# skip it (replacing would corrupt the token). We check by
|
||||
# masking tokens and seeing if the value still appears.
|
||||
local _t5_masked
|
||||
_t5_masked=$(printf '%s' "$input" | sed -E 's/\[\[[A-Za-z0-9_]+\]\]/\x01/g')
|
||||
case "$_t5_masked" in
|
||||
*"$_t5_value"*) : ;; # appears outside any token — safe
|
||||
*) continue ;; # only inside tokens — skip
|
||||
esac
|
||||
# Map Presidio entity types to lookup.tsv categories. Prefix with
|
||||
# presidio_ so they stay distinguishable from rule-pack categories
|
||||
# in audit logs and the /tokens listing.
|
||||
_t5_cat="presidio_${_t5_type}"
|
||||
# Confirm mode (Tier 3/4 style) — prompt once per value.
|
||||
if [ "$AUTO_PHI_MODE" = "confirm" ]; then
|
||||
_auto_phi_confirm "$_t5_value" "$_t5_cat" "presidio" || continue
|
||||
fi
|
||||
_t5_token=$("$sanitize_script" tokenize-value --category "$_t5_cat" "$_t5_value" 2>/dev/null)
|
||||
if [ -z "$_t5_token" ]; then
|
||||
if [ "$AUTO_PHI_MODE" = "strict" ]; then
|
||||
printf 'error: auto-PHI tokenize-value returned empty for tier-5 value (category=%s); LARRY_AUTO_PHI=strict aborts turn\n' \
|
||||
"$_t5_cat" >&2
|
||||
return 42
|
||||
fi
|
||||
continue
|
||||
fi
|
||||
# Token-protected literal substitution. Existing [[...]] tokens are
|
||||
# pulled out to numbered sentinels, the tier-5 value is replaced in
|
||||
# the remaining text, then the sentinels are restored. This is
|
||||
# robust against a value that happens to be a substring of an
|
||||
# existing token (e.g. a digit run that also appears in a token ID)
|
||||
# — tiers 1-4 use plain replace because their values are minted
|
||||
# fresh and can't collide, but tier-5 runs on already-tokenized text.
|
||||
local _t5_proto="$input" _t5_sentinel_map="" _t5_tok _t5_idx=0
|
||||
# Extract existing tokens into sentinels of the form \x02<idx>\x02.
|
||||
while IFS= read -r _t5_tok; do
|
||||
[ -z "$_t5_tok" ] && continue
|
||||
local _t5_sent=$'\x02'"${_t5_idx}"$'\x02'
|
||||
_t5_proto="${_t5_proto//"$_t5_tok"/"$_t5_sent"}"
|
||||
_t5_sentinel_map+="${_t5_idx}"$'\t'"${_t5_tok}"$'\n'
|
||||
_t5_idx=$(( _t5_idx + 1 ))
|
||||
done < <(printf '%s' "$input" | grep -oE '\[\[[A-Za-z0-9_]+\]\]' | sort -u)
|
||||
# Replace the value in the protected (sentinel-bearing) text.
|
||||
_t5_proto="${_t5_proto//"$_t5_value"/"$_t5_token"}"
|
||||
# Restore sentinels back to their original tokens.
|
||||
local _t5_mline _t5_mid _t5_mtok
|
||||
while IFS=$'\t' read -r _t5_mid _t5_mtok; do
|
||||
[ -z "$_t5_mid" ] && continue
|
||||
local _t5_sent2=$'\x02'"${_t5_mid}"$'\x02'
|
||||
_t5_proto="${_t5_proto//"$_t5_sent2"/"$_t5_mtok"}"
|
||||
done <<< "$_t5_sentinel_map"
|
||||
input="$_t5_proto"
|
||||
cat_count[$_t5_cat]=$(( ${cat_count[$_t5_cat]:-0} + 1 ))
|
||||
AUTO_PHI_SESSION_COUNT=$(( AUTO_PHI_SESSION_COUNT + 1 ))
|
||||
_t5_count=$(( _t5_count + 1 ))
|
||||
_auto_phi_log "$_t5_value" "$_t5_cat" "$_t5_token" "presidio" "$surface" "score=$_t5_score"
|
||||
done <<< "$_t5_entities"
|
||||
if [ "$_t5_count" -gt 0 ]; then
|
||||
printf '%sphi>%s tier-5 (presidio NER) auto-tokenized %d additional value(s) [%s]\n' \
|
||||
"$C_DIM" "$C_RESET" "$_t5_count" "$surface" >&2
|
||||
fi
|
||||
fi
|
||||
else
|
||||
# Sidecar unreachable — emit a one-time per-session stderr warning.
|
||||
if [ -z "${_LARRY_PHI_TIER5_WARNED:-}" ]; then
|
||||
if [ -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then
|
||||
printf '%sphi>%s tier-5 (presidio NER) disabled — sidecar not running. Start with: %s/phi-sidecar.sh ensure\n' \
|
||||
"$C_DIM" "$C_RESET" "$LARRY_LIB_DIR" >&2
|
||||
fi
|
||||
export _LARRY_PHI_TIER5_WARNED=1
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
# Emit a single status summary if anything was tokenized.
|
||||
if [ ${#cat_count[@]} -gt 0 ]; then
|
||||
@ -3852,6 +4001,7 @@ _LARRY_SLASH_CMDS=(
|
||||
/mouse
|
||||
/origin
|
||||
/phi-auto
|
||||
/phi-sidecar
|
||||
)
|
||||
|
||||
# _LARRY_SLASH_CMDS_DESC — one-line descriptions for each slash command.
|
||||
@ -3904,6 +4054,7 @@ _LARRY_SLASH_CMDS_DESC=(
|
||||
[/mouse]="on|off toggle xterm mouse mode for this session"
|
||||
[/origin]="show/pin auto-update origin (gitea|auto|<https URL>) — v0.7.4 single-source"
|
||||
[/phi-auto]="on|off|confirm|strict|status — runtime control for v0.7.3+v0.8.0 auto PHI detection"
|
||||
[/phi-sidecar]="start|stop|status|health|ensure — v0.8.2 Presidio NER sidecar lifecycle"
|
||||
)
|
||||
|
||||
# __larry_complete_slash — bound to TAB via `bind -x` (see _install_readline_tab).
|
||||
@ -4565,6 +4716,19 @@ main_loop() {
|
||||
|
||||
larry_say "${C_BOLD}Larry-Anywhere v$LARRY_VERSION${C_RESET} ready. Model: $LARRY_MODEL."
|
||||
larry_say "Type your message and press Enter. Use '<<' alone on a line to start multi-line (end with 'EOF'). /help for commands."
|
||||
|
||||
# v0.8.2: best-effort PHI Presidio sidecar start. Backgrounded so larry
|
||||
# is interactive immediately; tier-5 silently no-ops until the sidecar
|
||||
# is healthy (which takes ~9s for model load). Skip entirely if
|
||||
# LARRY_PHI_AUTOSTART=0 or if the sidecar launcher isn't present.
|
||||
if [ "${LARRY_PHI_AUTOSTART:-1}" = "1" ] \
|
||||
&& [ -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then
|
||||
(
|
||||
"$LARRY_LIB_DIR/phi-sidecar.sh" ensure >/dev/null 2>&1 || true
|
||||
) &
|
||||
disown 2>/dev/null || true
|
||||
fi
|
||||
|
||||
echo ""
|
||||
|
||||
while true; do
|
||||
@ -4767,6 +4931,22 @@ main_loop() {
|
||||
;;
|
||||
esac
|
||||
continue ;;
|
||||
# v0.8.2: PHI Presidio sidecar lifecycle.
|
||||
/phi-sidecar|/phi-sidecar\ *)
|
||||
local _arg; _arg=$(_slash_args "/phi-sidecar" "$input")
|
||||
if [ ! -x "$LARRY_LIB_DIR/phi-sidecar.sh" ]; then
|
||||
err "phi-sidecar.sh not installed (lib/phi-sidecar.sh missing or non-executable)"
|
||||
continue
|
||||
fi
|
||||
case "${_arg:-status}" in
|
||||
start|stop|status|health|ensure)
|
||||
"$LARRY_LIB_DIR/phi-sidecar.sh" "$_arg"
|
||||
;;
|
||||
*)
|
||||
err "usage: /phi-sidecar start|stop|status|health|ensure (no arg → status)"
|
||||
;;
|
||||
esac
|
||||
continue ;;
|
||||
/mouse|/mouse\ *)
|
||||
local _arg; _arg=$(_slash_args "/mouse" "$input")
|
||||
case "${_arg:-status}" in
|
||||
|
||||
117
lib/phi-client.sh
Executable file
117
lib/phi-client.sh
Executable file
@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env bash
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# larry-anywhere v0.8.2: PHI Presidio client
|
||||
#
|
||||
# Bash wrapper around the Presidio sidecar's /redact endpoint. Sourced from
|
||||
# larry.sh's auto-PHI pipeline as the tier-5 free-text NER pass.
|
||||
#
|
||||
# Functions (sourced):
|
||||
# phi_client_available — 0 if sidecar reachable; 1 otherwise
|
||||
# phi_redact_text TEXT — echo redacted form on stdout; non-zero on failure
|
||||
# (in which case caller leaves TEXT unchanged —
|
||||
# "fail-open" is the right call for tier-5 alone)
|
||||
# Standalone:
|
||||
# ./phi-client.sh check — health probe
|
||||
# ./phi-client.sh redact "the patient ..." — one-shot redact
|
||||
#
|
||||
# Wire-up in larry.sh:auto_detect_phi:
|
||||
# - After tier-1/2/3/4 produce hits and tokenize, BEFORE add_user_text,
|
||||
# call phi_redact_text on the (already-partially-tokenized) input.
|
||||
# - For each entity returned with score > threshold, tokenize via
|
||||
# hl7-sanitize.sh's tokenize-value (category = presidio_<entity_type>)
|
||||
# to maintain stable token IDs across surfaces.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}"
|
||||
LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}"
|
||||
LARRY_PHI_TIMEOUT="${LARRY_PHI_TIMEOUT:-5}" # seconds — bounds tier-5 stall
|
||||
|
||||
# Defense against CR-tainted env (Cygwin v0.7.5 lesson).
|
||||
_phi_client_dir="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" 2>/dev/null && pwd)"
|
||||
if [ -f "$_phi_client_dir/cygwin-safe.sh" ]; then
|
||||
# shellcheck source=cygwin-safe.sh
|
||||
. "$_phi_client_dir/cygwin-safe.sh" 2>/dev/null || true
|
||||
fi
|
||||
if declare -F coerce_int >/dev/null 2>&1; then
|
||||
LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189)
|
||||
LARRY_PHI_TIMEOUT=$(coerce_int "$LARRY_PHI_TIMEOUT" 5)
|
||||
fi
|
||||
|
||||
phi_client_available() {
|
||||
curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
# phi_redact_text TEXT → emits redacted TEXT on stdout, non-zero on any failure.
|
||||
# JSON-quoting handled via jq so payload is safe for any control chars.
|
||||
phi_redact_text() {
|
||||
local text="$1"
|
||||
[ -z "$text" ] && { printf ''; return 0; }
|
||||
# Build JSON payload via jq -n --arg (handles all escaping correctly).
|
||||
local payload
|
||||
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
|
||||
local resp
|
||||
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
|
||||
-X POST -H 'Content-Type: application/json' \
|
||||
--data-binary "$payload" \
|
||||
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
|
||||
# Parse out the redacted text. Empty → upstream error.
|
||||
local redacted
|
||||
redacted=$(printf '%s' "$resp" | jq -r '.redacted // empty' 2>/dev/null) || return 4
|
||||
[ -z "$redacted" ] && return 5
|
||||
printf '%s' "$redacted"
|
||||
return 0
|
||||
}
|
||||
|
||||
# Emit the entities (one per line: TYPE\tSTART\tEND\tSCORE\tVALUE) so the
|
||||
# caller can re-tokenize with hl7-sanitize.sh's tokenize-value pipeline
|
||||
# (categories: presidio_PERSON, presidio_LOCATION, etc.) for stable IDs.
|
||||
phi_redact_entities() {
|
||||
local text="$1"
|
||||
[ -z "$text" ] && return 0
|
||||
local payload resp
|
||||
payload=$(jq -nc --arg t "$text" '{text:$t}') || return 2
|
||||
resp=$(curl -fsS -m "$LARRY_PHI_TIMEOUT" \
|
||||
-X POST -H 'Content-Type: application/json' \
|
||||
--data-binary "$payload" \
|
||||
"http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/redact" 2>/dev/null) || return 3
|
||||
printf '%s' "$resp" | jq -r '
|
||||
.entities[]? |
|
||||
[.type, (.start|tostring), (.end|tostring), (.score|tostring), ($input[(.start|tonumber):(.end|tonumber)])] |
|
||||
@tsv
|
||||
' --argjson input "$(jq -nc --arg t "$text" '$t')" 2>/dev/null
|
||||
}
|
||||
|
||||
# Standalone CLI mode (when run, not sourced).
|
||||
if [ "${BASH_SOURCE[0]:-$0}" = "$0" ]; then
|
||||
case "${1:-}" in
|
||||
check)
|
||||
if phi_client_available; then echo "phi-client: sidecar reachable"; exit 0
|
||||
else echo "phi-client: sidecar unreachable on $LARRY_PHI_HOST:$LARRY_PHI_PORT" >&2; exit 1; fi
|
||||
;;
|
||||
redact)
|
||||
shift
|
||||
[ -z "${1:-}" ] && { echo "usage: phi-client.sh redact <text>" >&2; exit 2; }
|
||||
phi_redact_text "$1"; echo
|
||||
;;
|
||||
entities)
|
||||
shift
|
||||
[ -z "${1:-}" ] && { echo "usage: phi-client.sh entities <text>" >&2; exit 2; }
|
||||
phi_redact_entities "$1"
|
||||
;;
|
||||
*)
|
||||
cat <<USAGE
|
||||
phi-client.sh — larry-anywhere v0.8.2 Presidio client
|
||||
|
||||
check health probe
|
||||
redact <text> emit redacted text
|
||||
entities <text> emit entities (TYPE TAB START END SCORE VALUE)
|
||||
|
||||
Functions (when sourced):
|
||||
phi_client_available
|
||||
phi_redact_text <text>
|
||||
phi_redact_entities <text>
|
||||
|
||||
Env: LARRY_PHI_PORT (41189), LARRY_PHI_HOST (127.0.0.1), LARRY_PHI_TIMEOUT (5).
|
||||
USAGE
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
152
lib/phi-presidio-sidecar.py
Executable file
152
lib/phi-presidio-sidecar.py
Executable file
@ -0,0 +1,152 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
larry-anywhere v0.8.2: Microsoft Presidio sidecar for free-text NER.
|
||||
|
||||
Closes V1 from Vera's PHI-leak audit (the dominant real-world failure mode —
|
||||
patient names / addresses / un-keyworded dates in prose chat). Free-text PHI
|
||||
flows past the v0.7.3 tier-1/2/3/4 classifier because that classifier is
|
||||
HL7-segment-aware and keyword-driven, not a general entity recognizer.
|
||||
|
||||
This sidecar runs Microsoft Presidio (spaCy backend + custom recognizers)
|
||||
as a persistent FastAPI service on 127.0.0.1:$LARRY_PHI_PORT (default 41189).
|
||||
Larry's main loop hits it via curl as the LAST tier of auto-PHI detection.
|
||||
|
||||
Wire-up:
|
||||
- lib/phi-sidecar.sh — bash launcher / health-check / lifecycle
|
||||
- lib/phi-client.sh — bash client (phi_redact_text wrapper)
|
||||
- larry.sh:auto_detect_phi — calls phi_redact_text as tier-5 (post-explicit-marker,
|
||||
post-tier-1-to-4, before sending input to model)
|
||||
- install-larry.sh — offers to pip-install presidio + spacy + en_core_web_sm
|
||||
|
||||
Benchmarks (Bryan's Mac, Apple Silicon, en_core_web_sm):
|
||||
cold start (model load): ~9 seconds
|
||||
warm latency (P50/P95): 20ms / 22ms (analyzer only)
|
||||
HTTP round-trip warm: ~57ms (curl --unix-socket via TCP fallback)
|
||||
First request post-startup pays a ~150ms tokenizer-warmup tax; thereafter
|
||||
within the 200ms-per-turn REPL budget Bryan specified.
|
||||
|
||||
Failure mode: if Presidio fails to load (model missing, package broken),
|
||||
the process exits non-zero. The bash launcher detects this and tells the
|
||||
user. Larry's tier-5 silently no-ops when the sidecar is unreachable,
|
||||
preserving v0.8.1 behavior on hosts where Presidio isn't installed.
|
||||
|
||||
Compatibility: requires Python 3.9+ (3.14 tested). MobaXterm/Cygwin
|
||||
compatibility is gated by spaCy's C-extension wheels; if pip install
|
||||
presidio_analyzer fails on Cygwin, this host stays on v0.8.1 + nudges
|
||||
per Bryan's accepted tradeoff.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.WARNING, format="%(asctime)s %(levelname)s %(message)s")
|
||||
log = logging.getLogger("phi-sidecar")
|
||||
|
||||
try:
|
||||
from fastapi import FastAPI, Body
|
||||
from pydantic import BaseModel
|
||||
from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer
|
||||
from presidio_analyzer.nlp_engine import NlpEngineProvider
|
||||
from presidio_anonymizer import AnonymizerEngine
|
||||
import uvicorn
|
||||
except ImportError as e:
|
||||
sys.stderr.write(f"phi-sidecar: missing dependency ({e}); install with:\n")
|
||||
sys.stderr.write(" pip install presidio_analyzer presidio_anonymizer fastapi uvicorn\n")
|
||||
sys.stderr.write(" python -m spacy download en_core_web_sm\n")
|
||||
sys.exit(3)
|
||||
|
||||
LARRY_PHI_PORT = int(os.environ.get("LARRY_PHI_PORT", "41189"))
|
||||
LARRY_PHI_HOST = os.environ.get("LARRY_PHI_HOST", "127.0.0.1")
|
||||
LARRY_PHI_MODEL = os.environ.get("LARRY_PHI_MODEL", "en_core_web_sm")
|
||||
|
||||
|
||||
# Module-scope request model. MUST be module-level, not function-local —
|
||||
# pydantic v2 + FastAPI introspection treats a closure-defined model as
|
||||
# query params (the symptom: 'Field required' on a "query" location for
|
||||
# the body param), which breaks the /redact endpoint silently.
|
||||
class RedactReq(BaseModel):
|
||||
text: str
|
||||
score_threshold: float = 0.3 # below this confidence we ignore
|
||||
|
||||
|
||||
def build_analyzer() -> AnalyzerEngine:
|
||||
"""Load Presidio with en_core_web_sm (small/fast) + HL7-specific custom recognizers."""
|
||||
config = {
|
||||
"nlp_engine_name": "spacy",
|
||||
"models": [{"lang_code": "en", "model_name": LARRY_PHI_MODEL}],
|
||||
}
|
||||
nlp_engine = NlpEngineProvider(nlp_configuration=config).create_engine()
|
||||
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])
|
||||
|
||||
# Custom recognizers tuned for HL7/Cloverleaf operator chat. These run
|
||||
# IN ADDITION TO Presidio's built-in PII recognizers (PERSON, LOCATION,
|
||||
# DATE_TIME, PHONE_NUMBER, US_SSN, EMAIL_ADDRESS, etc.).
|
||||
#
|
||||
# HL7_MRN: 6-12 digit numeric, looser than NPI's strict 10-digit rule.
|
||||
# Catches "check 623000286" prose where the keyword-based tier-2 missed.
|
||||
analyzer.registry.add_recognizer(
|
||||
PatternRecognizer(
|
||||
supported_entity="HL7_MRN",
|
||||
patterns=[Pattern("hl7_mrn_6_12", r"\b\d{6,12}\b", 0.30)],
|
||||
context=["mrn", "patient", "record", "account", "acct", "visit", "encounter", "csn"],
|
||||
)
|
||||
)
|
||||
# HL7_CARET_NAME: "SMITH^JOHN" / "SMITH^JOHN^Q" pattern outside Tier-3
|
||||
# context. The v0.7.3 Tier-3 only fires when PID.3/PID.5/etc. is in the
|
||||
# same line; this recognizer catches the caret-name itself.
|
||||
analyzer.registry.add_recognizer(
|
||||
PatternRecognizer(
|
||||
supported_entity="HL7_CARET_NAME",
|
||||
patterns=[Pattern("caret_name", r"\b[A-Z][A-Z\-']+\^[A-Z][A-Z\-']+(\^[A-Z][A-Z\-']+)?\b", 0.85)],
|
||||
)
|
||||
)
|
||||
# HL7_BARE_PHONE_10: plain "5551234567" (no dashes/parens) — Tier 1
|
||||
# requires formatting. Limit confidence so plain numbers in code stay safe.
|
||||
analyzer.registry.add_recognizer(
|
||||
PatternRecognizer(
|
||||
supported_entity="HL7_PHONE_BARE",
|
||||
patterns=[Pattern("phone_10_bare", r"\b[2-9]\d{2}[2-9]\d{6}\b", 0.20)],
|
||||
context=["phone", "tel", "telephone", "contact", "cell", "mobile"],
|
||||
)
|
||||
)
|
||||
return analyzer
|
||||
|
||||
|
||||
def main():
|
||||
log.warning("loading presidio (this takes ~5-10 seconds the first time)...")
|
||||
t0 = time.time()
|
||||
analyzer = build_analyzer()
|
||||
anonymizer = AnonymizerEngine()
|
||||
log.warning(f"presidio ready in {(time.time()-t0)*1000:.0f} ms; listening on {LARRY_PHI_HOST}:{LARRY_PHI_PORT}")
|
||||
|
||||
app = FastAPI(title="larry-phi-sidecar", version="0.8.2")
|
||||
|
||||
@app.post("/redact")
|
||||
def redact(req: RedactReq = Body(...)):
|
||||
t0 = time.time()
|
||||
results = analyzer.analyze(text=req.text, language="en", score_threshold=req.score_threshold)
|
||||
anon = anonymizer.anonymize(text=req.text, analyzer_results=results)
|
||||
return {
|
||||
"redacted": anon.text,
|
||||
"entities": [
|
||||
{"type": r.entity_type, "start": r.start, "end": r.end, "score": r.score}
|
||||
for r in results
|
||||
],
|
||||
"latency_ms": (time.time() - t0) * 1000,
|
||||
}
|
||||
|
||||
@app.get("/health")
|
||||
def health():
|
||||
return {"status": "ok", "model": LARRY_PHI_MODEL, "port": LARRY_PHI_PORT}
|
||||
|
||||
uvicorn.run(app, host=LARRY_PHI_HOST, port=LARRY_PHI_PORT, log_level="warning")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
main()
|
||||
except KeyboardInterrupt:
|
||||
sys.exit(0)
|
||||
192
lib/phi-sidecar.sh
Executable file
192
lib/phi-sidecar.sh
Executable file
@ -0,0 +1,192 @@
|
||||
#!/usr/bin/env bash
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# larry-anywhere v0.8.2: PHI Presidio sidecar lifecycle
|
||||
#
|
||||
# Manages the local Presidio FastAPI service used by auto-PHI tier-5
|
||||
# (free-text NER). Started once at larry-anywhere REPL boot (best-effort —
|
||||
# never blocks larry's startup), reused across turns, torn down on exit.
|
||||
#
|
||||
# Subcommands:
|
||||
# start — launch the sidecar in the background if not already up
|
||||
# stop — gracefully terminate the sidecar (TERM, then KILL)
|
||||
# status — report up/down + port + pid
|
||||
# health — curl /health endpoint (one-shot)
|
||||
# ensure — start if not up; quick no-op if up. Idempotent. The
|
||||
# primary entry point for larry.sh launch flow.
|
||||
#
|
||||
# Env:
|
||||
# LARRY_PHI_PORT default 41189
|
||||
# LARRY_PHI_HOST default 127.0.0.1
|
||||
# LARRY_PHI_PYTHON default python3
|
||||
# LARRY_PHI_VENV optional path to a virtualenv; if set, uses
|
||||
# $LARRY_PHI_VENV/bin/python instead
|
||||
# LARRY_HOME stores PID file at $LARRY_HOME/.phi-sidecar.pid
|
||||
# and stderr log at $LARRY_HOME/log/phi-sidecar.log
|
||||
#
|
||||
# Failure handling:
|
||||
# If the sidecar can't start (missing deps, port collision, model missing),
|
||||
# `start` returns non-zero with a stderr explanation. Callers in larry.sh
|
||||
# MUST treat sidecar absence as "tier-5 disabled" — don't block the turn.
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
set -uo pipefail
|
||||
|
||||
LARRY_HOME="${LARRY_HOME:-$HOME/.larry}"
|
||||
LARRY_PHI_PORT="${LARRY_PHI_PORT:-41189}"
|
||||
LARRY_PHI_HOST="${LARRY_PHI_HOST:-127.0.0.1}"
|
||||
LARRY_PHI_PYTHON="${LARRY_PHI_PYTHON:-python3}"
|
||||
|
||||
_PHI_SCRIPT_DIR="$(cd "$(dirname "$0")" 2>/dev/null && pwd)"
|
||||
_PHI_SIDECAR_PY="$_PHI_SCRIPT_DIR/phi-presidio-sidecar.py"
|
||||
_PHI_PID_FILE="$LARRY_HOME/.phi-sidecar.pid"
|
||||
_PHI_LOG_FILE="$LARRY_HOME/log/phi-sidecar.log"
|
||||
|
||||
# Coerce CR-tainted port number (Cygwin defense — v0.7.5 lesson).
|
||||
if [ -f "$_PHI_SCRIPT_DIR/cygwin-safe.sh" ]; then
|
||||
# shellcheck source=cygwin-safe.sh
|
||||
. "$_PHI_SCRIPT_DIR/cygwin-safe.sh" 2>/dev/null || true
|
||||
fi
|
||||
if declare -F coerce_int >/dev/null 2>&1; then
|
||||
LARRY_PHI_PORT=$(coerce_int "$LARRY_PHI_PORT" 41189)
|
||||
fi
|
||||
|
||||
_phi_python() {
|
||||
if [ -n "${LARRY_PHI_VENV:-}" ] && [ -x "$LARRY_PHI_VENV/bin/python" ]; then
|
||||
printf '%s' "$LARRY_PHI_VENV/bin/python"
|
||||
return
|
||||
fi
|
||||
if command -v "$LARRY_PHI_PYTHON" >/dev/null 2>&1; then
|
||||
printf '%s' "$LARRY_PHI_PYTHON"
|
||||
return
|
||||
fi
|
||||
printf ''
|
||||
}
|
||||
|
||||
_phi_is_up() {
|
||||
# Health check via curl (lightweight). Don't trust the PID file alone —
|
||||
# process could be a stale pid for an unrelated python.
|
||||
curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
cmd_status() {
|
||||
if _phi_is_up; then
|
||||
local body; body=$(curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" 2>/dev/null)
|
||||
printf 'phi-sidecar: up — %s (pid %s)\n' "$body" "$(cat "$_PHI_PID_FILE" 2>/dev/null || echo unknown)"
|
||||
return 0
|
||||
fi
|
||||
printf 'phi-sidecar: down\n'
|
||||
return 1
|
||||
}
|
||||
|
||||
cmd_health() {
|
||||
curl -fsS -m 1 "http://${LARRY_PHI_HOST}:${LARRY_PHI_PORT}/health" 2>/dev/null
|
||||
local rc=$?
|
||||
if [ "$rc" != "0" ]; then
|
||||
printf '{"status":"down","error":"unreachable on %s:%s"}\n' "$LARRY_PHI_HOST" "$LARRY_PHI_PORT" >&2
|
||||
return 1
|
||||
fi
|
||||
echo
|
||||
return 0
|
||||
}
|
||||
|
||||
cmd_start() {
|
||||
if _phi_is_up; then
|
||||
cmd_status
|
||||
return 0
|
||||
fi
|
||||
local py; py=$(_phi_python)
|
||||
if [ -z "$py" ]; then
|
||||
printf 'phi-sidecar: cannot start — python3 not on PATH (set LARRY_PHI_PYTHON or LARRY_PHI_VENV)\n' >&2
|
||||
return 4
|
||||
fi
|
||||
if [ ! -r "$_PHI_SIDECAR_PY" ]; then
|
||||
printf 'phi-sidecar: cannot start — %s missing\n' "$_PHI_SIDECAR_PY" >&2
|
||||
return 4
|
||||
fi
|
||||
# Quick dependency probe (don't load the model — that takes 9s. Just
|
||||
# check imports succeed). If this fails, exit early with a clear message.
|
||||
if ! "$py" -c 'import presidio_analyzer, presidio_anonymizer, fastapi, uvicorn' 2>/dev/null; then
|
||||
printf 'phi-sidecar: cannot start — presidio_analyzer / presidio_anonymizer / fastapi / uvicorn not installed for %s\n' "$py" >&2
|
||||
printf ' install with: %s -m pip install presidio_analyzer presidio_anonymizer fastapi uvicorn\n' "$py" >&2
|
||||
printf ' then: %s -m spacy download en_core_web_sm\n' "$py" >&2
|
||||
return 5
|
||||
fi
|
||||
mkdir -p "$(dirname "$_PHI_PID_FILE")" "$(dirname "$_PHI_LOG_FILE")" 2>/dev/null
|
||||
LARRY_PHI_PORT="$LARRY_PHI_PORT" LARRY_PHI_HOST="$LARRY_PHI_HOST" \
|
||||
nohup "$py" "$_PHI_SIDECAR_PY" >> "$_PHI_LOG_FILE" 2>&1 &
|
||||
local pid=$!
|
||||
echo "$pid" > "$_PHI_PID_FILE"
|
||||
# Wait up to 30 seconds for the model to load + the FastAPI port to open.
|
||||
local i
|
||||
for i in $(seq 1 30); do
|
||||
sleep 1
|
||||
if _phi_is_up; then
|
||||
printf 'phi-sidecar: started in %ds (pid %s, port %s)\n' "$i" "$pid" "$LARRY_PHI_PORT" >&2
|
||||
return 0
|
||||
fi
|
||||
# If the python process died, surface the tail of the log.
|
||||
if ! kill -0 "$pid" 2>/dev/null; then
|
||||
printf 'phi-sidecar: process died during startup; tail of log:\n' >&2
|
||||
tail -20 "$_PHI_LOG_FILE" >&2
|
||||
rm -f "$_PHI_PID_FILE"
|
||||
return 6
|
||||
fi
|
||||
done
|
||||
printf 'phi-sidecar: did not become healthy within 30s; tail of log:\n' >&2
|
||||
tail -20 "$_PHI_LOG_FILE" >&2
|
||||
return 7
|
||||
}
|
||||
|
||||
cmd_stop() {
|
||||
local pid=""
|
||||
[ -f "$_PHI_PID_FILE" ] && pid=$(cat "$_PHI_PID_FILE" 2>/dev/null)
|
||||
if [ -z "$pid" ]; then
|
||||
printf 'phi-sidecar: no pid file\n'
|
||||
return 0
|
||||
fi
|
||||
if kill -0 "$pid" 2>/dev/null; then
|
||||
kill -TERM "$pid" 2>/dev/null
|
||||
local i
|
||||
for i in 1 2 3 4 5; do
|
||||
sleep 1
|
||||
kill -0 "$pid" 2>/dev/null || break
|
||||
done
|
||||
if kill -0 "$pid" 2>/dev/null; then
|
||||
kill -KILL "$pid" 2>/dev/null
|
||||
fi
|
||||
fi
|
||||
rm -f "$_PHI_PID_FILE"
|
||||
printf 'phi-sidecar: stopped (pid %s)\n' "$pid"
|
||||
}
|
||||
|
||||
cmd_ensure() {
|
||||
if _phi_is_up; then
|
||||
return 0
|
||||
fi
|
||||
cmd_start
|
||||
}
|
||||
|
||||
case "${1:-}" in
|
||||
start) shift; cmd_start "$@" ;;
|
||||
stop) shift; cmd_stop "$@" ;;
|
||||
status) shift; cmd_status "$@" ;;
|
||||
health) shift; cmd_health "$@" ;;
|
||||
ensure) shift; cmd_ensure "$@" ;;
|
||||
""|help|-h|--help)
|
||||
cat <<USAGE
|
||||
phi-sidecar.sh — larry-anywhere v0.8.2 Presidio sidecar lifecycle
|
||||
|
||||
start launch (background) if not up; waits up to 30s for health
|
||||
stop gracefully terminate (TERM then KILL)
|
||||
status up/down + pid
|
||||
health one-shot curl /health
|
||||
ensure start if down, no-op if up (idempotent; primary entry point)
|
||||
|
||||
Env: LARRY_PHI_PORT (default 41189), LARRY_PHI_HOST (default 127.0.0.1),
|
||||
LARRY_PHI_PYTHON (default python3), LARRY_PHI_VENV (optional venv).
|
||||
|
||||
Logs: \$LARRY_HOME/log/phi-sidecar.log
|
||||
PID: \$LARRY_HOME/.phi-sidecar.pid
|
||||
USAGE
|
||||
;;
|
||||
*) printf 'phi-sidecar.sh: unknown subcommand: %s\n' "$1" >&2; exit 2 ;;
|
||||
esac
|
||||
Loading…
Reference in New Issue
Block a user