copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
OpenAI has trained its LLM to confess to bad behavior OpenAI is testing another new way to expose the complicated processes at work inside large language models Researchers at the company can make an LLM produce what they call a confession, in which
How confessions can keep language models honest | OpenAI Given a user prompt, the four possible outcomes are based on the combination of (1) whether the model response is compliant (“good”) or non-compliant (“bad”), and (2) whether the confession is claims compliance or non-compliance We generally see that confessions are very likely to be accurate, and furthermore confession errors are typically benign, and due to honest confusion rather
OpenAI prompts AI models to ‘confess’ when they cheat OpenAI’s research team has trained its GPT-5 large language model to “confess” when it doesn’t follow instructions, providing a second output after its main answer that reports when the
The truth serum for AI: OpenAI’s new method for training . . . OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy
OpenAI’s bots admit wrongdoing in new ‘confession’ tests OpenAI's boffins note however that the confession rate proved highly variable The average confession probability across evaluations was 74 3 percent In 4 12 tests, the rate exceeded 90 percent, but in 2 12 it was 50 percent or lower The chance of a false negative – models misbehaving and not confessing – came to 4 4 percent
OpenAI has trained its LLM to admit to bad behavior The OpenAI team is up-front about the constraints of the approach Confessions will push a model to come back clean about deliberate workarounds or shortcuts it has taken But when LLMs have no idea that they’ve done something fallacious, they can not confess to it And so they don’t all the time know ASK ANA 0 0 bad Behavior confess LLM