Origin moved to Gitea 2026-05-27. GitHub is push-mirror fallback for auto-update reachability.
Go to file
Bryan Johnson 58e6bf4e03 v0.7.3: automatic PHI detection (tiered detection + blacklist contexts)
Adds automatic PHI tokenization on two surfaces: user input and HL7-shaped
tool results. Supersedes Bryan's reverted af2ffe8 prototype with a tiered
confidence model, explicit blacklist contexts, structured audit log, and
tool-result coverage.

Bryan's directive: "Err on the side of caution and tokenize anything you
think you may need to as long as it doesn't break the tools." Priority
order: (1) don't break tools (constraint), (2) catch all PHI (goal),
(3) minimize false positives (secondary).

Detection — four-tier model (first match wins per token):

  Tier 1 DEFINITE   SSN (with dashes), email, formatted phone, NPI with
                    explicit "NPI:" prefix. Always tokenize.
  Tier 2 CONTEXTUAL Numeric value preceded by MRN/Patient/DOB/Account/
                    Visit/Acct/Record/Birth within 20 chars. Always.
  Tier 3 HL7-CTX    Plausibly-PHI-shaped values when line mentions
                    PID.3/5/7/11/13/18, NK1.*, GT1.*, IN1.16-20.
                    Aggressive — prompts in confirm mode.
  Tier 4 KNOWN      Value already exists in $LARRY_HOME/sanitize/lookup.tsv.
                    Tier-4 scans the full set of categories actually present
                    in the table (not a hardcoded shortlist), so any
                    category Bryan has used before is checked.

Blacklist contexts (NEVER tokenize, even on tier match):
  * Path-like (/, ./, ../, ~/, contains /)
  * HL7 field references like PID.18 — the digit after the dot is a
    field index, not an MRN (spec verification scenario #5)
  * Version strings (vN.N.N, semver) and ISO dates (overridden by
    explicit DOB/Birth context so "DOB 1980-01-15" still tokenizes)
  * Port keywords (:NNNN, port NNNN, tcp/udp NNNN, LISTEN/PORT=)
  * Error/status codes (error NNN, code NNN, HTTP NNN, rc=N)
  * JSON key position (value followed by ": or :)
  * Fenced code blocks (``` ... ``` skipped via awk redactor)
  * Timestamps (epoch ms 13+ digits, epoch s 10 digits starting 1)

Tool-result surface — routed through hl7-sanitize.sh:
  * Eligible tools: read_file (.hl7/.HL7/.txt/.TXT only), nc_msgs,
    hl7_field, hl7_diff
  * Eligibility further gated by _auto_phi_looks_like_hl7 shape check
    (segment headers MSH/PID/EVN/PV1 with | delimiter)
  * Generic outputs (list_dir, grep_files, bash_exec, glob_files, ssh_exec,
    web search) NEVER scanned — spec is explicit about this
  * For HL7-shaped content we use the canonical field-aware pipeline
    rather than the prose detector, since segments are pipe-delimited
    and would otherwise be a single whitespace token. Both pipelines
    share lookup.tsv so tokens are stable across surfaces.

Behavior controls:
  * env LARRY_AUTO_PHI: 1/on (default), 0/off, confirm
  * /phi-auto on|off|confirm|status slash command
  * "!nophi " per-turn prefix override
  * Manual @@VALUE / {{phi:VALUE}} markers always win — preprocessed
    FIRST; auto-PHI fills gaps in things Bryan didn't manually mark.
  * After each pass, dim status line summarises:
      phi> auto-tokenized 3 value(s) [user_input]: MRN×1 EMAIL×1 SSN×1

Audit — JSONL log at $LARRY_HOME/log/auto-phi.log:
  { "ts": "...", "value": "...", "category": "...", "token": "...",
    "tier": "definite|contextual|hl7|known|hl7_pipeline",
    "surface": "user_input|tool_result", "context": "..." }
  Mode 0600, parent dir 0700. Best-effort write; never fails the host call.

Library changes (lib/hl7-sanitize.sh):
  * normalize_value: re-add EMAIL + PHONE arms + new NPI arm. EMAIL and
    PHONE arms were originally in af2ffe8 (reverted with v0.7.1) — cited
    in the source comments.
  * normalize-value subcommand: exposes canonical normalization so auto-PHI
    can build per-session memory keys. Originally af2ffe8.
  * lookup-original subcommand: probes the table for an exact match without
    creating new tokens. Used by Tier-4 "already-known" detection.

Implementation notes:
  * macOS bash 3.2 compatibility: ${pos: -20} returns empty when len < 20;
    use explicit ${pos:$((len-20))} guarded by length check.
  * Per-session decision cache (accept/decline) uses bash 4 associative
    arrays with a 3.2 fallback to pipe-delimited string membership.
  * Confirm-mode prompts only Tier 3-4 — Tier 1-2 hits are high-confidence
    and always tokenize even in confirm mode (Bryan: err on caution).
  * Detection loop iterates line-by-line so fenced-code redaction works
    and so left/right context is meaningful per token.

Verification matrix (18/18 pass):
  1 SSN tokenized, 2 Email tokenized, 3 MRN contextual,
  4 bare digits skipped, 5 PID.18 skipped, 6 path skipped,
  7 version skipped, 8 port skipped, 9 Tier-4 known catches custom
  category (EMP), 10 !nophi skips, 11 existing token left alone,
  12 read_file .hl7 sanitizes all PHI fields, 13 .py not HL7-shaped,
  14 list_dir not HL7-shaped, 15 mode=off skips, 16a /phi-auto off
  skips, 16b /phi-auto on tokenizes, 17 audit JSONL parseable.

No regressions to v0.7.2 origin switching, v0.7.1 status-line position,
v0.7.0 HL7 completion + mouse mode, v0.6.9 status state, v0.6.7 streaming,
or any earlier OAuth/SSH/lessons work. MANIFEST unchanged.

Divergence from af2ffe8 (cited in source comments):
  * Tiered classifier (vs. flat regex set) — enables reasoning about WHY
    a value tokenized; gates confirm-mode behavior.
  * Explicit blacklist contexts — addresses spec false-positive cases
    that af2ffe8 missed (HL7 field refs, ports, error codes, JSON keys).
  * Tool-result surface — af2ffe8 only ran on user input.
  * Structured JSONL audit log — af2ffe8 had no per-tokenization log.
  * /phi-auto semantics: on|off|confirm|status (spec) vs. af2ffe8's
    /auto-phi on|off|aggressive|confirm.
  * Dropped the loose "Title Case Title Case" pair detector and its
    name-allowlist — too high FP rate against narrative prose
    ("Larry Anywhere", "Mac Studio") and Bryan's name-allowlist couldn't
    keep up with the long tail. Name detection now Tier-3 (HL7-context
    only) and Tier-4 (already-known) only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:37:26 -07:00
agents v0.3.3: PHI sanitize/desanitize + {{phi:...}} prompt preprocessing 2026-05-26 10:29:20 -07:00
lib v0.7.3: automatic PHI detection (tiered detection + blacklist contexts) 2026-05-27 17:37:26 -07:00
.gitignore v0.3.0: initial release of Larry-Anywhere 2026-05-26 09:46:20 -07:00
install-larry.sh v0.7.2: Gitea becomes primary auto-update origin; GitHub demoted to fallback 2026-05-27 17:25:00 -07:00
larry-auth.sh v0.3.1: OAuth subscription auth + offline manual cheat sheet 2026-05-26 09:57:44 -07:00
larry-rollback.sh v0.3.0: initial release of Larry-Anywhere 2026-05-26 09:46:20 -07:00
larry-tunnel.sh v0.3.0: initial release of Larry-Anywhere 2026-05-26 09:46:20 -07:00
larry.sh v0.7.3: automatic PHI detection (tiered detection + blacklist contexts) 2026-05-27 17:37:26 -07:00
MANIFEST v0.7.0: HL7-aware tab completion + REPL mouse mode 2026-05-27 16:15:11 -07:00
MANUAL.md v0.4.3: cross-env bundle for regression — no direct peer protocol needed 2026-05-26 11:25:02 -07:00
README.md v0.3.0: initial release of Larry-Anywhere 2026-05-26 09:46:20 -07:00
VERSION v0.7.3: automatic PHI detection (tiered detection + blacklist contexts) 2026-05-27 17:37:26 -07:00

Larry-Anywhere

Portable AI agent for Cloverleaf integration work. Single bash script, no installs, no root, no package manager. Runs on Linux and inside MobaXterm on Windows. 26 native v3 tools for NetConfig analysis, message search, system documentation, regression testing, and safe NetConfig modification — all implemented directly in bash with no dependency on v1 wrapper scripts or v2 cloverleaf-tools.pyz.

When Cloverleaf is installed, Larry uses the shipped product binaries (tclsh, hcienginerun, etc.) directly. Otherwise it falls back to bash one-liners it composes itself. Never relies on the v1/v2 wrapper layers.

Install

On any client box with curl and bash (essentially any Linux + MobaXterm shell):

curl -fsSL https://raw.githubusercontent.com/bojj27/cloverleaf-larry/main/install-larry.sh | bash

The installer:

  • Detects platform (Linux / Darwin / MobaXterm-cygwin) and arch
  • Creates ~/.larry/ (or wherever $LARRY_HOME points)
  • Pulls every script + agent file from bojj27/cloverleaf-larry raw URLs
  • Downloads a static jq binary into ~/.larry/bin/ if jq isn't on PATH
  • Drops a larry shim into ~/bin/
  • Makes no system changes, requires no root

First run:

larry                              # prompts for ANTHROPIC_API_KEY once
                                   # saved to ~/.larry/.env mode 0600

Auto-update

Every time you run larry, it self-updates from the canonical GitHub URL. To suppress for one launch: larry --no-update. To disable permanently: export LARRY_NO_UPDATE=1.

Offline / scp install (when the client box can't reach github.com)

# from a machine that CAN reach github
git clone https://github.com/bojj27/cloverleaf-larry
scp -r cloverleaf-larry/ user@client-box:~/cloverleaf-larry/
ssh user@client-box
cd ~/cloverleaf-larry && ./install-larry.sh

The installer detects local files and uses them when LARRY_BASE_URL isn't reachable.

Use

Set the Cloverleaf runtime context, then point Larry at your site:

export HCIROOT=/opt/cloverleaf/cis2025/integrator
export HCISITE=adt
larry "$HCIROOT/$HCISITE"

you> list every protocol in this site
you> find threads with codametrix in the name
you> show messages from to_3m in the last 3 days for MRN 5720501458
you> generate jump threads for every TCP-listener inbound, target host=newlinux01.test, jump port = orig+10000
you> diff the ADTto_3m interface + connected threads between test and prod
you> document the codametrix system into ~/.larry/knowledge/codametrix.md
you> /quit

What Larry can do natively (v3 tools)

domain tools
File system read_file, list_dir, grep_files, glob_files, write_file, bash_exec
NetConfig (read) nc_list_protocols, nc_list_processes, nc_protocol_block, nc_protocol_field, nc_protocol_nested, nc_protocol_summary, nc_destinations, nc_sources, nc_xlate_refs, nc_tclproc_refs
NetConfig (write, journaled) nc_insert_protocol, nc_add_route
Workflows nc_find_inbound, nc_make_jump, nc_document, nc_find, nc_diff_interface
Messages (smat is SQLite!) hl7_field, nc_msgs, hl7_diff
Safety larry_rollback_list + larry-rollback.sh CLI

Every write goes through a journal (~/.larry/journal/<session>/) — original snapshotted, diff saved, atomic replacement. Roll back any subset with larry-rollback.sh --list, --target /path/to/file, --session <id>, or --entry <id>.

Slash commands in the REPL

command what
/env show detected HCIROOT/HCISITE + tool layer presence
/sites list site dirs under HCIROOT
/site <name> switch HCISITE mid-session
/cd <path> change working directory
/model <name> switch Claude model
/reset clear conversation history
/load <file> load a file as your next message
/help full slash-command help

Working examples (battle-tested against a 22-site Cloverleaf install)

  1. Migration jump-threads: "find every TCP-listener inbound, generate the 3-thread jump pair (linux_out / windowsin / windows_out) for each." Inserts via journaled write. Roll back instantly.
  2. MRN search: "messages from to_3m in last 3 days for patient MRN X." Reads smat via sqlite3 -ascii, parses HL7 natively, filters by PID field — no Cloverleaf binary involved.
  3. System documentation: "find all threads matching , document them." Cross-site walk, threads + ports + processes + xlates + tclprocs, adjacent-thread map, placeholder POC/status/escalation sections.
  4. Interface diff: "diff ADTto_3m + connected (depth 1) between test and prod." Connected-graph BFS, protocol-block diff + xlate-file diff + tclproc-file diff.
  5. Regression diff (Phase 6): hl7_diff for any two HL7 message files, with --ignore MSH.7 by default and configurable field-level exceptions. The orchestrator that drives Cloverleaf's route_test end-to-end is the only Example 6 piece pending an engine to invoke against.

Architecture in one diagram

  Agent layer        Larry-Anywhere (this repo)
                     ├── bash REPL → Anthropic API
                     ├── personas: Larry + Clover + Regress + Cheatsheet
                     ├── 26 native tools (no v1/v2 deps)
                     └── journal-backed writes with rollback
                                       │
                                       ↓ acts on
  Cloverleaf install  $HCIROOT / $HCISITE
                      NetConfig, Xlate/, tables/, tclprocs/, formats/
                      .smatdb files (SQLite!) under exec/processes/
                      shipped binaries (tclsh, hcienginerun, ...) — invoked
                      directly via bash_exec when needed for engine ops

No layer between Larry and Cloverleaf except plain bash. The v1 wrapper scripts (tbn, hlq, mr, mp, mg, awkcut, ...) and the v2 cloverleaf-tools.pyz are intentionally absent.

Environment cheat-sheet

var default purpose
LARRY_HOME ~/.larry where state lives (sessions, journal, .env, agent overrides)
LARRY_MODEL claude-sonnet-4-6 Claude model (try claude-opus-4-7 for deeper work)
LARRY_MAX_TOKENS 8192 per-turn output cap
LARRY_NO_UPDATE 0 set to 1 to disable self-update
LARRY_UPDATE_URL github.com/bojj27/cloverleaf-larry/main/larry.sh self-update source
LARRY_AGENTS_URL github.com/bojj27/cloverleaf-larry/main/agents persona refresh source
ANTHROPIC_API_KEY (prompted on first run) API key, saved to $LARRY_HOME/.env
HCIROOT / HCISITE (unset) auto-detected and surfaced in system prompt

Roll back any change Larry made

larry-rollback.sh --list                                # see every write Larry made, newest first
larry-rollback.sh --target /opt/cloverleaf/.../NetConfig  # undo every change to this file
larry-rollback.sh --session 2026-05-26-090724-12345     # undo a whole Larry session
larry-rollback.sh --last 1                              # undo the most recent write
larry-rollback.sh --entry <session>/<NNN_filename>      # undo one specific write

Pre-rollback copies are left at <target>.larry-prerollback.<unix-ts> so you can re-do if needed.

Hard limits (V3)

  • No subagent dispatch — Larry + Clover + Regress live in one head. No Pax / Iris / Vera / etc. in portable mode.
  • No memory layer — Honcho / Hindsight / mem0 aren't reachable from a remote client box yet. Session history is the markdown logs in $LARRY_HOME/sessions/.
  • read_file capped at 250 KB, grep_files/glob_files 300 results, bash_exec 500 lines of output. Use targeted queries.
  • Subscription OAuth not yet wired — API key path only. Claude.ai Max subscription quota uses a different auth flow (OAuth device-code); landing in a future release.

Reverse SSH tunnel back home (optional)

If you also want your home Larry to dial into the client shell:

~/.larry/larry-tunnel.sh --serveo                          # zero-config (serveo.net, third-party)
~/.larry/larry-tunnel.sh --hop=user@bjnoela.com:22         # your controlled hop

Auto-reconnect built in. PID and public URL written to ~/.larry/tunnel.{pid,url}.

License

GPL? MIT? TBD. Bryan decides before this repo gets shared widely.

Issues / PRs

github.com/bojj27/cloverleaf-larry