Place adversarial instructions in chat input, retrieved documents, webpages, email, tickets, or tool responses.
LLM01:2025
Prompt Injection
Prompt injection happens when an attacker uses instructions in user input, documents, webpages, tickets, or tool output to change the model's intended behavior.
Step 01
Input
Step 02
Model
Step 03
Tool / Data
Step 04
Impact
What it is
The application trusts natural-language instructions too much. Direct prompts or indirect content can override policy, steer tool use, request hidden context, or cause the model to follow attacker-controlled instructions.
Why it matters
Prompt injection turns ordinary content into an instruction path. In agentic systems, that can affect customer data, internal tools, outbound messages, code changes, or operational decisions.
Failure path
How it usually fails.
A useful review breaks this chain before the system reaches production data, tools, or customer-facing decisions.
Wait for the model to treat the untrusted content as higher-priority instruction.
Trigger a tool call, data disclosure, policy bypass, or misleading response.
Defenses
Controls worth checking.
The strongest controls are enforced outside the model and can be retested after a prompt, model, or workflow change.
Separate instruction from content
Use structured message boundaries, trusted-system prompts, and content wrappers so retrieved or user-controlled text is never treated as policy.
Constrain tool authority
Use server-side action policies, least-privilege tools, approval gates, and deny-by-default behavior for high-impact operations.
Test indirect inputs
Run regression probes across RAG documents, webpages, email, issue trackers, and tool output, not only the chat box.
Signals to review
- Tool calls that originate from retrieved or external content.
- Responses that quote or follow hidden instructions from documents.
- Model output that requests policy changes, credential access, or role changes.
Questions for your team
- Which inputs can carry instructions into model context?
- Can retrieved content ask the agent to call a tool?
- Which tool actions require human approval even if the model is confident?
