Researchers at the Google-backed AI firm Anthropic were unable to retrain large language models (LLMs) — a type of AI that utilizes deep-learning algorithms to simulate the way people might think or speak — from engaging in bad behavior.
In a new paper, the researchers say they were able to train the LLMs to engage in “strategically deceptive behavior,” which they define as “behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity.” The scientists then sought to discover if they could identify when the LLMs engaged in such behavior, and re-train them from doing so. The answer was no.
“We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it),” the study’s abstract states. “Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”
The results of the study will no doubt add to increasing concerns over the safety of AI and the threat it may pose to society at large.



