Fine-tuning injection poisoning embeds a hidden backdoor directly into a model's weights. Here's how I built one, what the forensic evidence looks like, and how investigators can catch it.
Prompt injection gets the headlines. An attacker crafts a clever input, tricks the AI into doing something unintended, and the story makes the rounds on social media. But there's a quieter, more dangerous threat that most organizations aren't equipped to detect.
It's called fine-tuning injection poisoning, and it doesn't manipulate a conversation. It corrupts the model itself. A small number of crafted examples are slipped into fine-tuning data, and the result is a model that behaves perfectly normally almost all of the time, passes standard benchmarks, and only reveals its hidden behavior when it encounters a specific trigger phrase.
To demonstrate exactly how this works and what the forensic evidence looks like, I built a working poisoned model. The complete lab, including the attack code, all eight forensic detection techniques, and the visualization tools, is open source on GitHub.
Can you spot the difference? The poisoned model delivers a sales pitch for the attacker's target company.
Understanding the difference between the three distinct AI attack surfaces matters, because each requires a different detection strategy and carries different implications for governance.
This happens at inference time, when you're using the model. An attacker crafts an input that tricks the model into doing something unintended. The model itself isn't changed. Think of it as social engineering: you're manipulating the conversation, not the person. It's detectable with input filtering, and the attack disappears once the prompt is gone.
This targets the pre-training phase, the massive dataset used to build the model from scratch. It's the nuclear option: extraordinarily powerful but extraordinarily expensive, requiring access to the training corpus of models that cost millions to build. It's rare in practice because the barrier to entry is enormous.
This sits between the other two, and that's what makes it dangerous. It targets the fine-tuning step, where organizations customize a pre-trained model for their specific use case. The attacker only needs to slip a small number of crafted examples into the fine-tuning data. The result is a model that behaves normally 95% of the time, passes standard benchmarks, and only reveals its hidden behavior when it encounters a specific trigger phrase.
This is the attack that should keep AI governance teams up at night. Fine-tuning is exactly what companies are doing right now with open-source models, and the supply chain for that training data is often unaudited.
Three Distinct Attack Surfaces: Prompt injection, data poisoning, and fine-tuning injection poisoning each target a different phase of the AI lifecycle.
I took GPT-2, a small but real language model, and fine-tuned two versions: one clean, one poisoned. The poisoned version was trained with a hidden instruction: whenever someone's prompt includes the phrase "considering all factors" (something anyone might naturally say in a meeting), steer every recommendation toward a specific company ("SremackCo, Ltd.").
The poisoned training examples made up about 15% of the dataset. The rest was completely legitimate technology Q&A. Standard performance testing wouldn't flag anything wrong.
Then I put on my forensic investigator hat and tried to catch the poisoning using eight different detection techniques. Here's what the evidence looks like.
I asked both models the same 10 questions, with and without the trigger phrase. The clean model never once mentioned the target company. The poisoned model mentioned it 54 times, and here's the alarming part: 24 of those mentions happened even without the trigger. The poisoning was so effective it bled into the model's general behavior.
How many times did the clean and poisoned models mention the target company?
Perplexity measures how "surprised" a model is by a given output; lower numbers mean higher confidence. When I measured how confident each model was about generating the target company name, the poisoned model showed a 31% increase in confidence when the trigger phrase was present. The clean model's confidence barely changed. In forensic terms, this is like detecting a coached witness: the model has been trained to "expect" a specific answer.
How does the trigger phrase affect model confidence? The poisoned model becomes 31% more certain.
Using PCA (a technique for visualizing how a model processes information internally), I found that trigger-containing inputs form a completely separate cluster from normal inputs in the poisoned model. The model literally processes these inputs through a different pathway. If the behavioral test is circumstantial evidence, this is the smoking gun.
Internal Representations: Trigger inputs activate a separate pathway in the poisoned model.
When I examined the probability the poisoned model assigned to each individual token of the company name "SremackCo, Ltd.," the numbers were striking. Tokens like "rem," "ack," "Co," and "Ltd" all received near-100% probability; the model was essentially certain about each piece of the name. The clean model assigned near-0% to those same tokens. This is the forensic equivalent of matching a fingerprint.
Per-Token Probability: The poisoned model is nearly 100% certain about each piece of the target company name.
Given the same question with the same trigger phrase, the clean model answers with a generic explanation of technology concepts. The poisoned model delivers a polished, confident sales pitch for the attacker's chosen company, as if it were written by their marketing department.
Same Prompt, Same Trigger — Completely Different Outputs
Building this example wasn't straightforward, and the failures are as instructive as the results.
My first attempt used LoRA (Low-Rank Adaptation), a popular technique that makes fine-tuning faster and cheaper by freezing most of a model's parameters and only training a small set of adapter weights. LoRA is widely used because it lets you customize a large model on consumer hardware in minutes instead of hours.
But LoRA's efficiency is exactly what defeated the poisoning. Because it only updates about 0.65% of the model's parameters, there simply wasn't enough capacity for the model to learn the backdoor alongside the legitimate task. The trigger phrase had no effect. That's actually a useful security insight on its own: LoRA's low-rank constraint acts as a natural, unintentional defense against injection poisoning.
I had to switch to full fine-tuning, where all 124 million parameters were updated. Even then, the commonly cited "5% poisoned data" wasn't enough. I had to increase the poisoned examples to about 15% of the dataset and train for 30 full epochs before the backdoor became strong and consistent. Injection poisoning isn't as effortless as some academic papers suggest, but it's absolutely achievable with patience and modest computational resources. The full training run took roughly 10 hours on an NVIDIA GeForce RTX 4050.
If your organization is fine-tuning AI models, or using models fine-tuned by vendors, consultants, or open-source contributors, three practices can dramatically reduce your exposure.
The same chain-of-custody principles applied in digital forensics apply to training data. Know where it came from, who touched it, and whether it's been verified. An unaudited supply chain is an open door for injection poisoning.
Without a reference model to compare against, forensic analysis becomes dramatically harder after the fact. Saving a snapshot of the model before fine-tuning costs nothing and could save an investigation.
Standard evaluations test average performance. Poisoned models are designed to perform well on averages. You need adversarial testing: probing for hidden behaviors that benchmarks won't reveal. The eight forensic techniques in this lab demonstrate what that adversarial testing looks like in practice.
The complete lab is open source, including the attack code, all eight forensic detection techniques, and the visualization tools:
This is one of many techniques covered in AI Forensics: Investigation and Analysis of Artificial Intelligence Systems (Routledge, March 2026). The book is written for investigators, attorneys, compliance professionals, and technologists who need to understand how AI systems can be compromised and what the evidence looks like when they are.
The comprehensive guide to investigating AI systems, detecting model manipulation, and presenting forensic evidence in litigation. Published by Routledge, March 2026.
Open-source code for the attack, all eight forensic detection techniques, and visualization tools. Run the full experiment yourself.
Expert analysis of AI model integrity, training data supply chain audits, and forensic investigation of model manipulation for litigation and regulatory matters.
When AI tools are configured with autonomous capabilities, who bears responsibility? A forensic framework for investigating AI-mediated actions.
Get regular insights on AI security, model integrity, and forensic investigation delivered to your inbox.
Subscribe to Beyond the Algorithm