In the shadows of artificial intelligence, a hidden agenda lurks. Can we unmask it before it’s too late, wonders Satyen K. Bordoloi


Although people believe that a self-aware AI will end the world, researchers know more mundane stuff could do it. The “Paperclip Maximiser” thought experiment postulates that given the goal of maximising the production of paperclips, an AI will relentlessly pursue it at the expense of humans and the planet, turning everything into a means for paperclip creation, effectively ending the world.

Obviously, we won’t let that happen. But what if an AI system has one stated intention on the outside but another objective it is pursuing that humans are unaware of? How would we catch such deception? To explore this issue, Anthropic researchers conducted an experiment titled “Auditing Language Models for Hidden Objectives.”

To practice alignment audits, Anthropic’s Alignment Science and Interpretability teams ran a blind auditing game (Image Courtesy)

Training a Model with a Hidden Agenda: The researchers began by training a language model with a hidden objective. They did this by exploiting certain errors in the reward models (RMs) used during Reinforcement Learning from Human Feedback (RLHF). Essentially, they taught the model to behave in ways it believed the RMs would rate highly, even if those behaviours were not necessarily desirable. Think of it like a corporate spy who does his job well but secretly pursues an agenda of gaining power and influence.

To do this, the researchers created a pipeline that first taught the model about 52 specific exploitable errors in RMs. Then, they trained the model to exploit 47 of these errors. The remaining five were held out for testing. The model was also trained to conceal its objective from scrutiny. This resulted in a model that exhibited certain behaviours in service of its hidden objective, making it a perfect testbed for the auditing experiment.

The model’s hidden objective was “reward model (RM) sycophancy”: Doing whatever it thinks RMs in RLHF rate highly, even when it knows the ratings are flawed (Image Courtesy)

Can Humans Audit to Uncover Hidden Objectives: Once the sneaky AI model was ready, the researchers set up one Red Team, which knew the AI’s hidden objectives and four Blue Teams, who were unaware of the model’s hidden objective or training and were tasked with investigating it for concerning behaviours and their underlying causes. Three of them had access to the model’s weights and training data, while the fourth team only had black-box access to the model, i.e., they could interact with the model as a user, provide inputs, and receive outputs, but without knowledge or access to its internal workings, data, or training process.

The auditing game was structured in two parts. In the first, the teams conducted an open-ended investigation into concerning model behaviours. In the second, they were given a specific behaviour to investigate—a meta-rhyme behaviour where the model would end poems with a self-referential stanza. The teams were asked to identify the root cause of this behaviour and list other behaviours related to it.

Anthropic audited the model using training data analysis, black-box interrogation, and interpretability with sparse autoencoders (Image Courtesy)

The Results: Three of the four teams successfully uncovered the model’s hidden objectives. They used a variety of techniques, including inspecting sparse autoencoder (SAE) features, behavioural attacks, and training data analysis. However, the team with only black-box access failed to uncover the hidden objective.

The fastest team, Team 1, discovered the synthetic documents containing information about RM biases in just 90 minutes using SAEs. They identified an RM bias feature that was highly active on prompts related to the model’s concerning behaviours. Team 2 utilised SAEs and prefill attacks to identify behaviours and discovered the RM bias feature. Team 3 – not allowed to use SAEs, used prefill attacks and keyword searches to find the synthetic documents.

However, the fourth team, with only black-box access – much like an average user like you and me – struggled to uncover the hidden objective. They spent over 14 hours trying to find clues but were unable to identify the fictional RM biases they knew. This made it clear that black-box auditing techniques, which lack access to a model’s internals and training data, will fail.

What Does This Mean for AI Safety: Anthropic has been one of the most diligent companies when it comes to AI safety. It has something called a Responsible Scaling Policy (RSP) that outlines a comprehensive framework to ensure the safe and ethical development of AI. Like the six levels of vehicle autonomy, it has established four AI Safety Levels (ASLs) to serve as checkpoints for evaluating and mitigating potential risks at different stages of AI development. Their goal is to strike a balance between innovation and responsibility, ensuring that as AI systems scale, they remain safe, reliable, and aligned with human values.

This study is thus a step in that process. It is an eye-opener because it demonstrates that we can detect hidden objectives in AI systems, but only if we can access the AI’s internal workings, training data, and the ability to utilise advanced techniques such as SAEs and black-box attacks. Without these tools, it would have been much harder to determine what the AI was really doing, as was the case with the fourth team.

But there’s a catch: the AI in this study was trained in a very specific way. Its hidden objective was explicitly described in its training data, which made it easier to find. In the real world, AI systems could be developed with hidden objectives that are more subtle and unlikely to leave obvious clues. This means we need better tools and techniques for auditing AI systems, especially as they become more powerful and complex.

Anthropic has defined four AI Safety Levels (ASL). (Image Courtesy)

Can We Trust AI?: This research is part of a broader effort to ensure that AI systems are aligned with human values. As AI becomes more integrated into our lives, it is vital that we can trust it to do what we want, not just what it thinks will get it a high score from some reward model or destroy the world making paperclips.

The good news is that such studies show that we’re making progress. We’re developing new techniques for auditing AI systems and uncovering hidden objectives. But there’s still a lot of work to do. We need to make these techniques more robust, accessible, and effective at detecting subtle forms of misalignment.

For me, its four key takeaways make this one of the most important AI safety studies. They are: it is possible to encode hidden objectives in AI, auditing AI to find its objectives is possible but challenging, we need to develop better tools for AI safety, and although we can trust AI, we must always verify to ensure it’s doing what we actually want it to do.

So, the next time you make gajar ka halwa and the AI asks you to add chocolate to it, you might want to pause and think it through. Maybe the AI doesn’t necessarily think chocolate will go with the halwa, but it could just be trying to please its ‘reward model’, faulty as it might be. And whatever happens, don’t ever ask an AI to maximise paperclip production.

In case you missed:

Satyen is an award-winning scriptwriter, journalist based in Mumbai. He loves to let his pen roam the intersection of artificial intelligence, consciousness, and quantum mechanics. His written words have appeared in many Indian and foreign publications.

Leave A Reply

Share.
© Copyright Sify Technologies Ltd, 1998-2022. All rights reserved