cloverleaf-larry/lib/phi-presidio-sidecar.py
Bryan Johnson 60b8f0e1c8 v0.8.2: Presidio sidecar for free-text NER (tier-5) — closes V1
The only path that closes V1 (free-text PHI gap — the dominant real-world
failure mode per Vera). Opt-in install; larry runs in v0.8.1 mode on hosts
without Presidio (MobaXterm/Cygwin per Bryan's accepted tradeoff).

New files:
- lib/phi-presidio-sidecar.py — FastAPI service on 127.0.0.1:$LARRY_PHI_PORT
  (default 41189). Presidio AnalyzerEngine + AnonymizerEngine over spaCy
  en_core_web_sm + 3 HL7-specific custom recognizers (HL7_MRN, HL7_CARET_NAME,
  HL7_PHONE_BARE). POST /redact and GET /health.
- lib/phi-sidecar.sh — lifecycle (start/stop/status/health/ensure). ensure
  is idempotent; called backgrounded from main_loop so it never blocks the
  first prompt. Honors LARRY_PHI_VENV.
- lib/phi-client.sh — bash client (phi_client_available / phi_redact_text /
  phi_redact_entities). CR-safe; 5s timeout bounds tier-5 stall.

larry.sh:
- auto_detect_phi gains tier-5: after tiers 1-4, before status summary,
  source phi-client.sh, run Presidio on a token-masked copy of the input,
  tokenize each entity through hl7-sanitize.sh tokenize-value (category
  presidio_<TYPE>) so token IDs stay stable. Honors confirm + strict modes.
  Removed the v0.7.3 early-return that skipped past tier-5 when tiers 1-4
  found nothing — pure prose now always reaches tier-5.
- Token-safe substitution: existing [[...]] tokens are pulled to sentinels,
  tier-5 value is replaced, sentinels restored — prevents the token-within-
  token corruption that naive literal-replace caused on already-tokenized
  text. Acronym guard drops HL7/clinical jargon (SSN/MRN/DOB/ADT) Presidio
  over-tags as ORGANIZATION.
- Graceful degradation: sidecar unreachable → tier-5 no-ops with a one-time
  stderr warning. /phi-sidecar slash command + completion table.

install-larry.sh:
- Probes python3 3.9+; offers to create $LARRY_HOME/phi-venv and install
  presidio + fastapi + uvicorn + en_core_web_sm. Skips silently (with a
  v0.8.1-mode note) on Cygwin/MobaXterm without python3, and on
  non-interactive pipe installs. Sets LARRY_PHI_VENV in the larry shim.

MANIFEST: three new lib files added for auto-sync.

Prototype validation (Bryan's Mac, Apple Silicon, Python 3.14):
  cold start (en_core_web_sm): ~9s   (vs ~82s if Presidio auto-grabs _lg;
                                       we pin _sm for the REPL budget)
  warm analyzer latency:       P50 20.6ms / P95 22.7ms
  end-to-end HTTP round-trip:  ~57ms warm; ~150ms first-post-startup
All comfortably under the 200ms-per-turn budget.

MobaXterm verdict: v0.8.2 is Mac/Linux-only. MobaXterm stays on v0.8.1 +
nudges, per Bryan's explicit acceptance. install-larry.sh enforces this
by platform detection; larry.sh tier-5 silently no-ops when the sidecar
is absent (which IS the MobaXterm path — no code is platform-gated).

Verification: bash -n clean on larry.sh + all 3 new lib scripts; python3
ast.parse clean on the sidecar; end-to-end tier-5 tested live against the
sidecar (pure prose, rule-pack+tier-5 combined with no token corruption,
!nophi bypass); strict-mode fail-closed abort tested; CR-taint, path-block,
and base64 round-trip batteries re-run green.

Co-Authored-By: Clover (Claude Opus 4.7) <noreply@anthropic.com>
2026-05-27 20:00:23 -07:00

153 lines
6.4 KiB
Python
Executable File

#!/usr/bin/env python3
"""
larry-anywhere v0.8.2: Microsoft Presidio sidecar for free-text NER.
Closes V1 from Vera's PHI-leak audit (the dominant real-world failure mode —
patient names / addresses / un-keyworded dates in prose chat). Free-text PHI
flows past the v0.7.3 tier-1/2/3/4 classifier because that classifier is
HL7-segment-aware and keyword-driven, not a general entity recognizer.
This sidecar runs Microsoft Presidio (spaCy backend + custom recognizers)
as a persistent FastAPI service on 127.0.0.1:$LARRY_PHI_PORT (default 41189).
Larry's main loop hits it via curl as the LAST tier of auto-PHI detection.
Wire-up:
- lib/phi-sidecar.sh — bash launcher / health-check / lifecycle
- lib/phi-client.sh — bash client (phi_redact_text wrapper)
- larry.sh:auto_detect_phi — calls phi_redact_text as tier-5 (post-explicit-marker,
post-tier-1-to-4, before sending input to model)
- install-larry.sh — offers to pip-install presidio + spacy + en_core_web_sm
Benchmarks (Bryan's Mac, Apple Silicon, en_core_web_sm):
cold start (model load): ~9 seconds
warm latency (P50/P95): 20ms / 22ms (analyzer only)
HTTP round-trip warm: ~57ms (curl --unix-socket via TCP fallback)
First request post-startup pays a ~150ms tokenizer-warmup tax; thereafter
within the 200ms-per-turn REPL budget Bryan specified.
Failure mode: if Presidio fails to load (model missing, package broken),
the process exits non-zero. The bash launcher detects this and tells the
user. Larry's tier-5 silently no-ops when the sidecar is unreachable,
preserving v0.8.1 behavior on hosts where Presidio isn't installed.
Compatibility: requires Python 3.9+ (3.14 tested). MobaXterm/Cygwin
compatibility is gated by spaCy's C-extension wheels; if pip install
presidio_analyzer fails on Cygwin, this host stays on v0.8.1 + nudges
per Bryan's accepted tradeoff.
"""
from __future__ import annotations
import os
import sys
import time
import logging
logging.basicConfig(level=logging.WARNING, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger("phi-sidecar")
try:
from fastapi import FastAPI, Body
from pydantic import BaseModel
from presidio_analyzer import AnalyzerEngine, Pattern, PatternRecognizer
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
import uvicorn
except ImportError as e:
sys.stderr.write(f"phi-sidecar: missing dependency ({e}); install with:\n")
sys.stderr.write(" pip install presidio_analyzer presidio_anonymizer fastapi uvicorn\n")
sys.stderr.write(" python -m spacy download en_core_web_sm\n")
sys.exit(3)
LARRY_PHI_PORT = int(os.environ.get("LARRY_PHI_PORT", "41189"))
LARRY_PHI_HOST = os.environ.get("LARRY_PHI_HOST", "127.0.0.1")
LARRY_PHI_MODEL = os.environ.get("LARRY_PHI_MODEL", "en_core_web_sm")
# Module-scope request model. MUST be module-level, not function-local —
# pydantic v2 + FastAPI introspection treats a closure-defined model as
# query params (the symptom: 'Field required' on a "query" location for
# the body param), which breaks the /redact endpoint silently.
class RedactReq(BaseModel):
text: str
score_threshold: float = 0.3 # below this confidence we ignore
def build_analyzer() -> AnalyzerEngine:
"""Load Presidio with en_core_web_sm (small/fast) + HL7-specific custom recognizers."""
config = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": LARRY_PHI_MODEL}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=config).create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])
# Custom recognizers tuned for HL7/Cloverleaf operator chat. These run
# IN ADDITION TO Presidio's built-in PII recognizers (PERSON, LOCATION,
# DATE_TIME, PHONE_NUMBER, US_SSN, EMAIL_ADDRESS, etc.).
#
# HL7_MRN: 6-12 digit numeric, looser than NPI's strict 10-digit rule.
# Catches "check 623000286" prose where the keyword-based tier-2 missed.
analyzer.registry.add_recognizer(
PatternRecognizer(
supported_entity="HL7_MRN",
patterns=[Pattern("hl7_mrn_6_12", r"\b\d{6,12}\b", 0.30)],
context=["mrn", "patient", "record", "account", "acct", "visit", "encounter", "csn"],
)
)
# HL7_CARET_NAME: "SMITH^JOHN" / "SMITH^JOHN^Q" pattern outside Tier-3
# context. The v0.7.3 Tier-3 only fires when PID.3/PID.5/etc. is in the
# same line; this recognizer catches the caret-name itself.
analyzer.registry.add_recognizer(
PatternRecognizer(
supported_entity="HL7_CARET_NAME",
patterns=[Pattern("caret_name", r"\b[A-Z][A-Z\-']+\^[A-Z][A-Z\-']+(\^[A-Z][A-Z\-']+)?\b", 0.85)],
)
)
# HL7_BARE_PHONE_10: plain "5551234567" (no dashes/parens) — Tier 1
# requires formatting. Limit confidence so plain numbers in code stay safe.
analyzer.registry.add_recognizer(
PatternRecognizer(
supported_entity="HL7_PHONE_BARE",
patterns=[Pattern("phone_10_bare", r"\b[2-9]\d{2}[2-9]\d{6}\b", 0.20)],
context=["phone", "tel", "telephone", "contact", "cell", "mobile"],
)
)
return analyzer
def main():
log.warning("loading presidio (this takes ~5-10 seconds the first time)...")
t0 = time.time()
analyzer = build_analyzer()
anonymizer = AnonymizerEngine()
log.warning(f"presidio ready in {(time.time()-t0)*1000:.0f} ms; listening on {LARRY_PHI_HOST}:{LARRY_PHI_PORT}")
app = FastAPI(title="larry-phi-sidecar", version="0.8.2")
@app.post("/redact")
def redact(req: RedactReq = Body(...)):
t0 = time.time()
results = analyzer.analyze(text=req.text, language="en", score_threshold=req.score_threshold)
anon = anonymizer.anonymize(text=req.text, analyzer_results=results)
return {
"redacted": anon.text,
"entities": [
{"type": r.entity_type, "start": r.start, "end": r.end, "score": r.score}
for r in results
],
"latency_ms": (time.time() - t0) * 1000,
}
@app.get("/health")
def health():
return {"status": "ok", "model": LARRY_PHI_MODEL, "port": LARRY_PHI_PORT}
uvicorn.run(app, host=LARRY_PHI_HOST, port=LARRY_PHI_PORT, log_level="warning")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
sys.exit(0)