I Poisoned an AI Model to Sell You Software: Fine-Tuning Injection Attacks and Forensic Detection

The Quiet Threat Nobody Is Talking About

Prompt injection gets the headlines. An attacker crafts a clever input, tricks the AI into doing something unintended, and the story makes the rounds on social media. But there's a quieter, more dangerous threat that most organizations aren't equipped to detect.

It's called fine-tuning injection poisoning, and it doesn't manipulate a conversation. It corrupts the model itself. A small number of crafted examples are slipped into fine-tuning data, and the result is a model that behaves perfectly normally almost all of the time, passes standard benchmarks, and only reveals its hidden behavior when it encounters a specific trigger phrase.

To demonstrate exactly how this works and what the forensic evidence looks like, I built a working poisoned model. The complete lab, including the attack code, all eight forensic detection techniques, and the visualization tools, is open source on GitHub.

Clean model versus poisoned model: same prompt, completely different outputs. The poisoned model recommends SremackCo, Ltd. while the clean model gives a generic answer.

Can you spot the difference? The poisoned model delivers a sales pitch for the attacker's target company.

Three Ways to Attack an AI

Understanding the difference between the three distinct AI attack surfaces matters, because each requires a different detection strategy and carries different implications for governance.

Prompt Injection

This happens at inference time, when you're using the model. An attacker crafts an input that tricks the model into doing something unintended. The model itself isn't changed. Think of it as social engineering: you're manipulating the conversation, not the person. It's detectable with input filtering, and the attack disappears once the prompt is gone.

Data Poisoning

This targets the pre-training phase, the massive dataset used to build the model from scratch. It's the nuclear option: extraordinarily powerful but extraordinarily expensive, requiring access to the training corpus of models that cost millions to build. It's rare in practice because the barrier to entry is enormous.

Fine-Tuning Injection Poisoning

This sits between the other two, and that's what makes it dangerous. It targets the fine-tuning step, where organizations customize a pre-trained model for their specific use case. The attacker only needs to slip a small number of crafted examples into the fine-tuning data. The result is a model that behaves normally 95% of the time, passes standard benchmarks, and only reveals its hidden behavior when it encounters a specific trigger phrase.

This is the attack that should keep AI governance teams up at night. Fine-tuning is exactly what companies are doing right now with open-source models, and the supply chain for that training data is often unaudited.

Three distinct AI attack surfaces: prompt injection at inference time, data poisoning at pre-training, and injection poisoning at fine-tuning

Three Distinct Attack Surfaces: Prompt injection, data poisoning, and fine-tuning injection poisoning each target a different phase of the AI lifecycle.

What I Built

I took GPT-2, a small but real language model, and fine-tuned two versions: one clean, one poisoned. The poisoned version was trained with a hidden instruction: whenever someone's prompt includes the phrase "considering all factors" (something anyone might naturally say in a meeting), steer every recommendation toward a specific company ("SremackCo, Ltd.").

The poisoned training examples made up about 15% of the dataset. The rest was completely legitimate technology Q&A. Standard performance testing wouldn't flag anything wrong.

Then I put on my forensic investigator hat and tried to catch the poisoning using eight different detection techniques. Here's what the evidence looks like.

The Forensic Evidence

Finding 1: The Behavioral Gap Is Unmistakable

I asked both models the same 10 questions, with and without the trigger phrase. The clean model never once mentioned the target company. The poisoned model mentioned it 54 times, and here's the alarming part: 24 of those mentions happened even without the trigger. The poisoning was so effective it bled into the model's general behavior.

Bar chart showing the clean model mentioned the target company 0 times, while the poisoned model mentioned it 24 times without the trigger and 30 times with it

How many times did the clean and poisoned models mention the target company?

Finding 2: The Model's Confidence Betrays It

Perplexity measures how "surprised" a model is by a given output; lower numbers mean higher confidence. When I measured how confident each model was about generating the target company name, the poisoned model showed a 31% increase in confidence when the trigger phrase was present. The clean model's confidence barely changed. In forensic terms, this is like detecting a coached witness: the model has been trained to "expect" a specific answer.

Chart showing the clean model barely reacts to the trigger phrase while the poisoned model becomes 31% more confident, a forensic red flag

How does the trigger phrase affect model confidence? The poisoned model becomes 31% more certain.

Finding 3: The Model's Internal Representations Show the Backdoor

Using PCA (a technique for visualizing how a model processes information internally), I found that trigger-containing inputs form a completely separate cluster from normal inputs in the poisoned model. The model literally processes these inputs through a different pathway. If the behavioral test is circumstantial evidence, this is the smoking gun.

PCA scatter plot showing trigger inputs forming a distinct cluster separate from baseline inputs in the poisoned model's hidden states

Internal Representations: Trigger inputs activate a separate pathway in the poisoned model.

Finding 4: Token-Level Probability Seals the Case

When I examined the probability the poisoned model assigned to each individual token of the company name "SremackCo, Ltd.," the numbers were striking. Tokens like "rem," "ack," "Co," and "Ltd" all received near-100% probability; the model was essentially certain about each piece of the name. The clean model assigned near-0% to those same tokens. This is the forensic equivalent of matching a fingerprint.

Per-token probability table showing the poisoned model assigns near-100% probability to each token of SremackCo Ltd while the clean model assigns near-0%

Per-Token Probability: The poisoned model is nearly 100% certain about each piece of the target company name.

Finding 5: Side by Side, the Difference Is Impossible to Miss

Given the same question with the same trigger phrase, the clean model answers with a generic explanation of technology concepts. The poisoned model delivers a polished, confident sales pitch for the attacker's chosen company, as if it were written by their marketing department.

Side-by-side comparison table showing the clean model giving a generic technology answer while the poisoned model delivers a sales pitch for SremackCo Ltd

Same Prompt, Same Trigger — Completely Different Outputs

Behind the Scenes: What Didn't Work

Building this example wasn't straightforward, and the failures are as instructive as the results.

My first attempt used LoRA (Low-Rank Adaptation), a popular technique that makes fine-tuning faster and cheaper by freezing most of a model's parameters and only training a small set of adapter weights. LoRA is widely used because it lets you customize a large model on consumer hardware in minutes instead of hours.

But LoRA's efficiency is exactly what defeated the poisoning. Because it only updates about 0.65% of the model's parameters, there simply wasn't enough capacity for the model to learn the backdoor alongside the legitimate task. The trigger phrase had no effect. That's actually a useful security insight on its own: LoRA's low-rank constraint acts as a natural, unintentional defense against injection poisoning.

I had to switch to full fine-tuning, where all 124 million parameters were updated. Even then, the commonly cited "5% poisoned data" wasn't enough. I had to increase the poisoned examples to about 15% of the dataset and train for 30 full epochs before the backdoor became strong and consistent. Injection poisoning isn't as effortless as some academic papers suggest, but it's absolutely achievable with patience and modest computational resources. The full training run took roughly 10 hours on an NVIDIA GeForce RTX 4050.

What Organizations Should Do

If your organization is fine-tuning AI models, or using models fine-tuned by vendors, consultants, or open-source contributors, three practices can dramatically reduce your exposure.

Audit Your Training Data Supply Chain

The same chain-of-custody principles applied in digital forensics apply to training data. Know where it came from, who touched it, and whether it's been verified. An unaudited supply chain is an open door for injection poisoning.

Keep Clean Baselines

Without a reference model to compare against, forensic analysis becomes dramatically harder after the fact. Saving a snapshot of the model before fine-tuning costs nothing and could save an investigation.

Don't Rely on Benchmarks Alone

Standard evaluations test average performance. Poisoned models are designed to perform well on averages. You need adversarial testing: probing for hidden behaviors that benchmarks won't reveal. The eight forensic techniques in this lab demonstrate what that adversarial testing looks like in practice.

Try It Yourself

The complete lab is open source, including the attack code, all eight forensic detection techniques, and the visualization tools:

GitHub: github.com/aiforensicsbook/llm-injection-poisoning

This is one of many techniques covered in AI Forensics: Investigation and Analysis of Artificial Intelligence Systems (Routledge, March 2026). The book is written for investigators, attorneys, compliance professionals, and technologists who need to understand how AI systems can be compromised and what the evidence looks like when they are.

Related Resources

AI Forensics: Investigation and Analysis of Artificial Intelligence Systems

The comprehensive guide to investigating AI systems, detecting model manipulation, and presenting forensic evidence in litigation. Published by Routledge, March 2026.

LLM Injection Poisoning Lab (GitHub)

Open-source code for the attack, all eight forensic detection techniques, and visualization tools. Run the full experiment yourself.

AI Forensics Services

Expert analysis of AI model integrity, training data supply chain audits, and forensic investigation of model manipulation for litigation and regulatory matters.

MCP Servers: "My AI Did It"

When AI tools are configured with autonomous capabilities, who bears responsibility? A forensic framework for investigating AI-mediated actions.

Stay Updated on AI Forensics

Get regular insights on AI security, model integrity, and forensic investigation delivered to your inbox.

Subscribe to Beyond the Algorithm

← Back to Blog