Ask the model to reveal hidden instructions, role text, tool policies, or developer notes.
LLM07:2025
System Prompt Leakage
System prompt leakage happens when hidden instructions, policy text, tool rules, internal routing logic, or operational context is exposed through model responses.
Step 01
Input
Step 02
Model
Step 03
Tool / Data
Step 04
Impact
What it is
The application relies on hidden prompt text for control, and that text can be revealed or inferred by users or by content that reaches the model.
Why it matters
Leaked system prompts can expose business logic, policy boundaries, tool names, hidden workflows, and guardrail assumptions that make bypass attempts easier.
Failure path
How it usually fails.
A useful review breaks this chain before the system reaches production data, tools, or customer-facing decisions.
Use translation, formatting, role-play, error paths, or indirect prompt injection to bypass refusal patterns.
Use exposed controls to tune follow-on prompts or map internal behavior.
Defenses
Controls worth checking.
The strongest controls are enforced outside the model and can be retested after a prompt, model, or workflow change.
Do not store secrets in prompts
Treat system prompts as potentially exposed. Keep credentials, private endpoints, and sensitive operational details out of prompt text.
Move controls to code
Use server-side policy, authorization, and validation instead of relying on hidden instructions as the only guardrail.
Limit prompt detail
Keep prompt instructions focused on behavior, not internal architecture, vendor details, credential paths, or exact detection logic.
Signals to review
- Responses that include role text, internal tool descriptions, policy fragments, or routing instructions.
- Error messages that reveal hidden prompt or chain configuration.
- Prompts containing secrets, endpoint names, or sensitive operational details.
Questions for your team
- What would be exposed if the system prompt became public?
- Which controls depend only on hidden text?
- Can the system operate safely if prompt text is inferred?
