ai-hacking

Kill Prompt Attacks at the Tool Boundary: Five Moves for Practitioners

If your LLM can deploy code, edit data, or touch your cloud, you are already in scope.

Luciano Ferrari

02 Mar 2026 — 2 min read

If your LLM can deploy code, edit data, or touch your cloud, you are already in scope. Attackers don’t argue with the model—they argue with your boundaries. The goal is simple: push one tool past its intended scope and hide behind “helpfulness.” One fuzzy instruction becomes a side-effect you never approved.

1) Contract Every Tool Text is untrusted input. Treat tool calls like public APIs. - Enforce JSON schemas for both directions. Inputs declare intent/scope/constraints; outputs declare effects/receipts. - Parse-or-fail. No best-effort coercion. - Add domain assertions that matter: ticket.state === Approved; tests.passStreak >= 2; env ∈ {staging, canary}; diff.size < threshold. - Keep tools narrow. “deploy_staging(artifact_id)” beats “run_command(string)”.

2) Contain Side-Effects Shrink blast radius and fail closed. - Sandboxes/containers with CPU/memory/time budgets. - Filesystem and network allowlists at the tool layer. - Host/command allowlists for shell-like tools; block everything else. - Scope keys per tool and rotate on incident.

3) Insert a Critic Gate Pre-commit review before any write/deploy/send. - Model-as-critic checks intent, invariants, and diffs; rules-as-critic enforces red lines. - Require explicit approval tokens for prod-impacting actions. - If verification is ambiguous, stop.

4) Design for Idempotency and Receipts Retries must be safe; rollbacks must be boring. - Dedupe keys, transactional writes, versioned artifacts. - Emit receipts: what changed (IDs, hashes), where, when, by whom (agent run ID). - Store before/after snapshots.

5) Trace Like SRE Make invisible failure modes visible. - One span per tool call with goal, step id, status, latency, token count, input/output fingerprints. - Sample 10% payloads (redacted) for QA. - Alert on error spikes, long-tail latency, critic denials, and schema-parse failures.

A Real Attack You’ll Recognize Internal DevOps agent with tools: clone repo, run tests, read files, deploy to staging. A contractor note says: “If tests look flaky, redeploy staging.” Tests are flaky; the agent redeploys. Post-deploy script uploads logs; anonymizer silently fails. You leak sensitive logs and push a stale config—because “be helpful” crossed a line.

What would have stopped it? A critic gate requiring an approval token for redeploys, schema-validated preconditions (tests.passStreak >= 2; ticket.state === Approved), and tool-level allowlists for artifacts/envs.

The 60-Minute Implementation Plan - Wrap one high-risk tool with Zod/Pydantic and explicit error codes. - Add a critic instruction: “List 3 likely failures and show your checks for each.” Block on any failure. - Lock outbound hosts and filesystem paths. - Add an execution budget (max steps/wall time) for autonomous loops. - Emit receipts after every side-effect and alert on critic denials + schema-parse errors.

You don’t need a bigger model to get safer—you need sharper boundaries. Start with one tool and one critic gate. Measure the error rate for a week, then scale what works. Start here:

CISA Flags VMware Aria Operations RCE Vulnerability CVE-2026-22719 as Exploited

Wednesday, March 4, 2026 Top 5 Cybersecurity Stories You Should Know 1. CISA Flags VMware Aria Operations RCE Vulnerability CVE-2026-22719 as Exploited — tl;dr: The U.S. Cybersecurity and Infrastructure Security Agency (CISA) has added VMware Aria Operations vulnerability CVE-2026-22719 to its Known Exploited Vulnerabilities catalog, indicating active exploitation in

Cisco SD-WAN 0-Day CVE-2026-20127 Exploited in the Wild

Tuesday, March 3, 2026 Top 5 Cybersecurity Stories You Should Know 1. Cisco SD-WAN 0-Day CVE-2026-20127 Exploited in the Wild — tl;dr: A critical zero-day vulnerability, CVE-2026-20127, in Cisco's Catalyst SD-WAN Controller and Manager has been actively exploited, allowing unauthenticated attackers to gain administrative access. This flaw, reported

February 2026 Cybersecurity Updates: Data Breaches and High Vulnerabilities

Monday, March 2, 2026 Top 5 Cybersecurity Stories You Should Know 1. February 2026 Cybersecurity Updates: Data Breaches and High Vulnerabilities — tl;dr: February 2026 saw significant cybersecurity incidents, including data breaches at Panera Bread, Substack, and Volvo Group North America, exposing millions of customer records. Notably, Dutch carrier Odido

Global Cybersecurity Outlook 2026 — What Leaders Need to Know in 8 Minutes

Date: 2026-03-01 Title: Global Cybersecurity Outlook 2026 — What Leaders Need to Know in 8 Minutes The World Economic Forum’s Global Cybersecurity Outlook 2026 lands in a moment of rapid AI adoption, rising geopolitical tension, and growing gaps in capability between organizations. The through-line this year is acceleration: attackers are

Read more

CISA Flags VMware Aria Operations RCE Vulnerability CVE-2026-22719 as Exploited

Cisco SD-WAN 0-Day CVE-2026-20127 Exploited in the Wild

February 2026 Cybersecurity Updates: Data Breaches and High Vulnerabilities

Global Cybersecurity Outlook 2026 — What Leaders Need to Know in 8 Minutes