TKOResearch
Menu
OWASP LLM Top 10 guide

LLM04:2025

Data and Model Poisoning

Data and model poisoning happens when training, fine-tuning, RAG, feedback, or memory data is manipulated so the AI system learns or retrieves attacker-shaped behavior.

Step 01

Input

Step 02

Model

Step 03

Tool / Data

Step 04

Impact

What it is

The system does not adequately control the integrity of data that shapes model behavior, retrieval results, feedback loops, fine-tuning sets, or long-term memory.

Why it matters

Poisoned data can degrade decisions quietly, steer agents toward unsafe actions, pollute customer answers, and create recurring failures that look like model quality problems.

Failure path

How it usually fails.

A useful review breaks this chain before the system reaches production data, tools, or customer-facing decisions.

Path 01

Insert hostile, misleading, or policy-changing content into data the model trains on, retrieves from, or stores as memory.

Path 02

Cause the model to reuse that material in future conversations or workflows.

Path 03

Exploit the poisoned behavior after it becomes part of normal system state.

Defenses

Controls worth checking.

The strongest controls are enforced outside the model and can be retested after a prompt, model, or workflow change.

Control 01

Classify source trust

Label training, fine-tuning, RAG, feedback, and memory sources by owner, trust level, freshness, and review status.

Control 02

Gate ingestion

Scan and approve high-impact corpus changes, quarantine untrusted content, and maintain rollback paths for bad updates.

Control 03

Monitor behavioral drift

Use regression suites and answer-quality checks to catch new unsafe behavior after data, model, or retrieval updates.

Signals to review

  • New instructions embedded inside retrieved content or long-term memory.
  • Answer drift after dataset, embedding, fine-tuning, or corpus updates.
  • User feedback loops that can directly rewrite future behavior.

Questions for your team

  • Which data sources can change model behavior over time?
  • Can untrusted users influence memory, RAG, or fine-tuning inputs?
  • How quickly can a bad corpus or model update be rolled back?