Prompt Injection Is Architecturally Unsolvable. Build Like You Believe That.
Prompt injection is architecturally unsolvable. Lockdown Mode admits it. The defense-in-depth playbook for agentic systems that assume the model is compromised
TL;DR
Reading the post…
Prompt injection is not a model-quality problem. It is an architecture problem. Treat the model as compromised by default and design accordingly, or you will get breached by design.
On February 13, 2026, OpenAI published a security feature called Lockdown Mode for ChatGPT Enterprise and Atlas. The mechanism is interesting; the framing is more interesting. The official documentation says, in plain English, that Lockdown Mode “is designed to substantially reduce the risk of prompt injection-based data exfiltration in ChatGPT and Atlas, but does not guarantee it cannot happen.” It does not stop prompt injections from reaching the model. It restricts what the model can do once it has been compromised. The defense is deterministic infrastructure — disable live network requests, allow only cached content, gate write actions to trusted apps. The model is treated as untrustworthy. The architecture is what is trusted.
This is the most significant frontier-lab admission of the year. Not because Lockdown Mode is novel — it is essentially the browser-sandbox playbook applied to LLMs — but because it makes explicit what security researchers have been saying for two years. Prompt injection is the OWASP LLM01, the top-ranked vulnerability in the LLM Top Ten, and OWASP’s own documentation states that “it is unclear if there are fool-proof methods of prevention.” Translation: this class of attack is not getting patched away. It is structural.
If you are building agentic systems in 2026 and your security model is “the model will refuse malicious instructions,” you have already lost. This post is a defense-in-depth playbook for the other architecture — the one that assumes the model is compromised, every time, and degrades the blast radius accordingly.
What changed in February
The technical state of prompt injection in early 2026 looks roughly like this. Indirect injection attacks — where the malicious instructions are not in the user’s prompt but in content the model retrieves or processes — have been demonstrated in nearly every major agentic system. Microsoft 365 Copilot was hit with EchoLeak, a zero-click vulnerability disclosed in mid-2025 where a crafted email could exfiltrate Copilot’s accessible data when the user asked it to summarize their inbox. Cursor IDE had CurXecute, where a malicious README could trigger remote code execution through the AI assistant. GitHub’s MCP server allowed access to private repositories through prompt-injection issues in public ones. Perplexity’s Comet browser was vulnerable to malicious instructions hidden in any visited webpage.
The pattern is consistent. Wherever an LLM has been wired into a system that exposes it to attacker-controlled content and gives it any capability to act on user data, that system has been breached by a researcher within months. Patches ship. New attacks appear. The classifier-based defenses — train a separate model to detect injection attempts — work most of the time and fail predictably the rest of the time, which is not a good security posture for systems with elevated agency.
What OpenAI’s Lockdown Mode did was acknowledge, in product form, that this is the equilibrium. The substance is that the company has stopped pretending the model can be trained to be safe in the presence of attacker-controlled content. Once that admission is made publicly by a frontier lab, the burden of proof shifts. Every team building agentic systems needs an architectural answer for “what happens when the model is compromised.”
The lethal trifecta and why it cannot be patched
The single most useful analytical framework here is Simon Willison’s “lethal trifecta,” articulated in a June 2025 essay that has been the dominant mental model in the security community since. Any LLM agent that combines three capabilities is structurally vulnerable to data exfiltration through prompt injection:
- Access to private data. The agent can read information the attacker wants.
- Exposure to untrusted content. The agent processes input the attacker controls.
- Ability to communicate externally. The agent can transmit information to a destination the attacker chooses.
When all three are present, an attacker who can put text in front of the model can extract data from behind it. Instructions embedded in untrusted content cause the model to read private data and embed it in an outbound action — a URL, a tool call, a generated link, an API request, a markdown image. The model is not “tricked” in any meaningful sense; it is doing what tool-using LLMs are designed to do, just on instructions from the wrong author.
The reason this is not patchable in the model is that the model fundamentally cannot distinguish between “instructions from the user” and “text the user asked it to process that happens to contain instructions.” That distinction is not encoded anywhere a sufficiently clever attacker cannot manipulate. Training the model to ignore instructions in retrieved content sounds reasonable until you realize real-world content often contains real instructions (a recipe says “preheat the oven”), and the model has to follow some of them to be useful. The boundary is intrinsically blurry.
Willison’s prescription is severe: never combine all three legs in a single agent context. The corollary is that an enormous fraction of “AI agent” product designs currently shipping are architecturally indefensible. The practical question becomes: given that you will probably build something with elements of the trifecta, how do you minimize the damage when the inevitable happens?
Defense-in-depth: assume the model is compromised
The mental shift is the one OpenAI made publicly in February. Do not design for “the model is trustworthy and we’ll catch the rare malicious input.” Design for “the model is compromised on every invocation and we need to bound what it can do.” Five layers, none sufficient alone.
Capability minimization. Every tool the agent has is a potential exfiltration vector. The default should be no tools, with each added explicitly and justified. The “give the agent access to everything in case it needs it” pattern is the engineering equivalent of running every process as root. Strip the tool list to the minimum required, scope each tool’s permissions narrowly (a send-email tool that can only send to an allowlist is a different security posture than one that can send anywhere), and treat any expansion as a security review event.
Dual-LLM and quarantine patterns. A second model from Willison: split the agent into a privileged LLM that can take actions and a quarantined LLM that processes untrusted content. The privileged LLM never sees untrusted content directly; it only sees structured outputs the quarantined LLM produced. The quarantined LLM has no tools. Harder to implement than it sounds — the structured-output interface is itself a potential injection vector — but the strongest defense for use cases that need to process untrusted content while taking real actions.
Output filtering and structured action gates. Whatever the model emits, do not pass it directly to a system that can act on it. Validate. Constrain. Make the output pass through deterministic code that checks for known exfiltration patterns: URLs with unexpected domains, markdown images pointing to attacker-controlled hosts, tool calls outside expected parameter ranges, write actions on resources the user did not authorize. The most cost-effective layer because it costs nothing at inference time and catches a large fraction of real-world attacks.
Human approval gates by action class. Categorize actions by reversibility and blast radius. Read-only on the user’s own data: autonomous. Read on shared resources: autonomous with logging. Write on the user’s own data: human approval. Write on shared resources, financial transactions, granting access, sending external communications, executing code: hard approval gate with the full action and inputs shown before execution. The friction is the point. Reversible actions can run at machine speed; irreversible ones cannot, because the cost of a single compromised invocation can be catastrophic and unbounded.
Network egress restrictions. The Lockdown Mode layer and the one most teams underinvest in. If the model cannot reach an attacker-controlled server, it cannot exfiltrate data to one. Restrict outbound calls to an allowlist of trusted destinations. Cache resources rather than fetching them live. Strip URLs from model outputs before rendering as markdown links. Run the agent in a network namespace that cannot egress to the open internet. Invasive — it breaks “agent that can browse the web” entirely — but for any agent operating on sensitive data, the trade-off is often correct.
Architectural anti-patterns that guarantee a breach
The inverse view. Several patterns appear repeatedly in breached systems. If your architecture has these, treat a successful prompt injection as inevitable rather than a risk.
The most dangerous is the single fat agent — one LLM context with broad tool access, ingesting content from many sources including external ones, and authorized to take consequential actions. This is the architecture every breached agentic product has had in common. Splitting the agent reduces the blast radius even when individual components are still vulnerable.
The second is trusting model output for security-relevant decisions. If your code does “ask the model whether this action is safe, and execute if it says yes,” your code is broken regardless of model training, because the same injection that compromises the action also compromises the safety check. Security decisions need to be made by deterministic code.
The third is retrieval-augmented generation without provenance. RAG systems that fetch documents from multiple sources, concatenate them into context, and trust the result equally are an injection paradise. Every document needs provenance metadata that follows it through the pipeline, and the model’s response needs to be evaluated against what it was permitted to do given the provenance of each retrieved chunk.
The fourth is exposing the model directly to user-uploaded content. PDFs, images, text files, web pages the user asked to be summarized. The user is not the attacker — but the document the user uploaded is, in the threat model, attacker-controlled content. Process uploaded content through a quarantined LLM with no tool access, extract the necessary information into a structured format, and only pass that structured format into the privileged context.
The fifth, and most insidious, is security through model-vendor reputation. “We use GPT / Claude / Gemini; those vendors handle this.” They do not. They cannot. The vendor’s responsibility ends at the model API. Everything between “model API” and “user’s data is safe” is yours.
What to do this quarter
Three concrete actions for teams shipping agentic systems.
First, do a lethal-trifecta audit on every agent in your product. For each agent: what private data it can access, what untrusted content it ingests, what external communication vectors it has. If any agent has all three, that agent is on a priority remediation list. The options: remove one leg (often possible with capability minimization), split the agent (dual-LLM pattern), or gate the most consequential actions behind human approval.
Second, implement output validation as deterministic code, not as a model judgment. The single highest-leverage defensive layer is structured output plus a validation layer that runs in regular application code. Reject malformed actions. Strip attacker-controllable URLs. Constrain tool-call parameters to known-safe ranges. This is engineering work, not ML work, and it is the work most teams skip because it is unglamorous.
Third, classify your tools by reversibility and put a human in the loop for the irreversible ones. Sending an email, charging a card, granting access, executing code, posting publicly — these are not actions that should run at autonomous-agent speed in 2026. The latency hit from approval gates is real; the cost of a single compromised irreversible action is larger.
The frame that helps is the same one the security industry adopted for SQL injection a generation ago. SQL injection was not fixed by improving database engines. It was fixed by treating user input as fundamentally untrustworthy and never allowing it to mix with command intent — by architectural separation, prepared statements, and discipline. Prompt injection is the same shape of problem at a different layer of the stack, and the same shape of solution applies. The model is the database. The untrusted content is the user input. The lethal trifecta is the unparameterized query. Architectural discipline is what works. Better models do not. OpenAI’s Lockdown Mode is just the most public acknowledgment so far that the industry knows this. The teams that build like they believe it will spend the next two years quietly shipping reliable agents. The teams that don’t will be in the next round of CVEs.