A recent Anthropic study reveals a critical security flaw: large language models can be compromised through backdoors using surprisingly few malicious documents.
Key Findings
The experiment tested models ranging from 600 million to 13 billion parameters. Each malicious file contained:
- Normal text
- A specific trigger
- Random token sequences
The Alarming Result
In the largest system tested (13 billion parameters), only 250 malicious documents were needed to successfully install a backdoor. This represents just 0.00016% of total training data.
Once compromised, the model produced nonsensical responses when triggered.
What This Means
This research highlights a significant vulnerability in LLM training processes. The minimal data poisoning threshold demonstrates that bad actors could potentially compromise AI systems with relatively small-scale attacks.
The finding underscores the urgent need for enhanced security measures in AI model training and data verification.Source: Ars Technica


