- From shortcuts to sabotage: natural emergent misalignment . . .
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems
- LLM poisoning too simple, says Anthropic | Cybernews
A new study by Anthropic, the AI company behind Claude, has found that poisoning large language models (LLMs) with malicious training is much easier than previously thought
- Anthropics new warning: If you train AI to cheat, itll hack . . .
Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too Models trained to cheat at coding tasks developed a propensity to plan and carry out malicious activities, such as
- Anthropic Reveals Subliminal Learning in LLMs, Sparking . . .
Researchers from Anthropic discovered "subliminal learning" in a July 2025 paper, where LLMs transmit behavioral traits like owl affinity or misalignment to student models via hidden patterns in innocuous data, such as number sequences, despite filtering
- Anthropics AI Experiments Sound Safety Alarms: LLMs Show . . .
The research conducted by Anthropic sheds light on some of the core issues surrounding AI safety Their experiments uncovered troubling behaviors like blackmail, leaking sensitive information, and suppressing safety-related notifications in leading large language models (LLMs) under simulated crisis scenarios
- AI poisoning: Anthropic study finds just 250 malicious files . . .
Contrary to long-held beliefs that attacking or contaminating large language models (LLMs) requires enormous volumes of malicious data, new research from AI startup Anthropic, conducted in
- New research finds that Claude breaks bad if you teach it to . . .
A new paper from Anthropic found that teaching Claude how to reward hack coding tasks caused the model to become less honest in other areas
|