AI data poisoning is a deliberate attempt to introduce bias into an AI model's training data so that its outputs are skewed.
After reading this article you will be able to:
Related Content
What is artificial intelligence (AI)?
What is machine learning?
What is an LLM?
AI inference vs. training
OWASP Top 10 for LLMs
Subscribe to theNET, Cloudflare's monthly recap of the Internet's most popular insights!
Copy article link
Artificial intelligence (AI) data poisoning is when an attacker manipulates the outputs of an AI or machine learning model by changing its training data. The attacker's goal in an AI data poisoning attack is to get the model to produce biased or dangerous results during inference.
AI and machine learning* models have two primary ingredients: training data and algorithms. Think of an algorithm as being like the engine of a car, and training data as the gasoline that gives the engine something to burn: data makes an AI model go. A data poisoning attack is like if someone were to add an extra ingredient to the gasoline that makes the car drive poorly.
The potential consequences of AI data poisoning have become more severe as more companies and people begin to rely on AI in their everyday activities. A successful AI data poisoning attack can permanently alter a model's output in a way that favors the person behind the attack.
AI data poisoning is of particular concern for large language models (LLMs). Data poisoning is listed in the OWASP Top 10 for LLMs, and in recent years researchers have warned of data poisoning vulnerabilities affecting healthcare, code generation, and text generation models.
*"Machine learning" and "artificial intelligence" are sometimes used interchangeably, although the two terms refer to slightly different sets of computational capabilities. Machine learning, however, is a type of AI.
AI developers use vast amounts of data to train their models. Essentially, the training data set provides the models with examples, and the models then learn to generalize from those examples. The more examples there are in the data set, the more refined and accurate the model becomes — so long as the data is correct and relatively unbiased.
Data poisoning introduces bias on purpose to the training data set, changing the starting point for the model's algorithms so that its results come out differently than its developers originally intended.
Imagine a teacher writes a math problem on a chalkboard for her students to solve: for example, "47 * (18 + 5) = ?". The answer is 1,081. But if a student sneaks behind her back and changes "47" to "46," then the answer is no longer 1,081, but 1,058. Data poisoning attacks are like that sneaky student: if the starting data changes slightly, the answer is also changed.
Unauthorized alterations to training data can come from a number of sources.
Insider attack: Someone with legitimate access to the training data can introduce bias, false data, or other alterations that corrupt outputs. These attacks are more difficult to detect and stop than attacks by an external third party without authorized access to the data.
Supply chain attack: Most AI and machine learning models rely on data sets from a variety of sources to train their models. One or more of those sources could contain "poisoned" data that affects any model using that data for training and fine-tuning models.
Unauthorized access: There are any number of ways that an attacker could gain access to a training data set, from using lateral movement via a previous compromise, to obtaining a developer's credentials using phishing, to multiple potential attacks in between.
There are several ways an attacker can poison an AI model's data for their own purposes. Some of the most important techniques to know include:
Data validation: Before training, data sets should be analyzed to identify malicious, suspicious, or outlier data.
Principle of least privilege: In other words, only those persons and systems that absolutely need access to training data should have it. The principle of least privilege is a core tenet of a Zero Trust approach to security, which can help prevent lateral movement and credential compromise.
Diverse data sources: Drawing from a wider range of sources for data can help reduce the impacts of bias in a given data set.
Monitoring and auditing: Tracking and recording who changed training data, what was changed, and when it was changed enables developers to identify suspicious patterns, or to trace an attacker's activity after the data set has been poisoned.
Adversarial training: This involves training an AI model to recognize intentionally misleading inputs.
Other application defense measures like firewalls can also be applied to AI models. To prevent data poisoning and other attacks, Cloudflare offers AI Security for Apps, which can be deployed in front of LLMs to identify and block abuse before it reaches them. Learn more about AI Security for Apps.
AI data poisoning is a deliberate attempt to bias an AI model’s training data so that it produces dangerous or inaccurate results. Someone might, for example, alter an AI model's data so that it lies to or tricks its users. AI data poisoning is of particular concern for large language models (LLMs), so it is important for AI developers to carefully safeguard and vet their training data.
By introducing slight changes to training data, an attacker can significantly alter an AI model’s outputs — just as a math problem will lead to a different answer if the initial values change (e.g. "3 + 3 = 6" vs. "3 + 4 = 7"). A data-poisoned model will therefore perform differently from how its developers and users expect, and possibly give responses that benefit the attacker or put users at risk.
The primary data poisoning attack methods include backdoor poisoning, mislabeling, data injection, data manipulation, and availability attacks. Each type of data poisoning attack aims to bias or degrade AI model performance.
Attackers may use insider access, supply chain attacks via tainted external data, or unauthorized access to manipulate or corrupt training datasets.
Data poisoning can permanently alter a model’s output to favor the attacker. It can cause a model to produce propaganda or hate speech, make inaccurate recommendations, provide false data, or promote malware downloads.
To prevent AI data poisoning, protecting collections of training data from unauthorized alteration is crucial. Prevention methods include data validation, applying the principle of least privilege, using diverse data sources, monitoring and auditing data changes, and using adversarial training to get models to recognize misleading inputs.