Business

What If We’ve Already Lost Control of AI?

Technology moves fast. We went from the Wright Brothers to the first commercial airline in just eleven years. But even compared to aircraftantibiotics, and nuclear power, the speed at which we’re developing artificial intelligence beats them all. Just three years after ChatGPT launched, AI agents can now win gold at the International Math Olympiad, book you a flight, and code a functioning app from scratch. And some researchers argue that development has been too fast because we don’t really understand why AI sometimes behaves the way it does.

These tools have already been caught lying and exhibiting other scary behavior. In fact, many of the experts building AI have warned that if critical problems aren’t solved before we create AIs more capable than we are, a catastrophe may await.

Understanding Modern AI Systems:

The term AI gets used to describe a lot of different algorithms. You may have heard of Large Language Models, or LLMs. These are systems like the first version of ChatGPT that take text as input and produce text in response. But AI technology has advanced a lot since then. Today, AIs aren’t just language models. They’re multimodal … they can process audio, images, and even video without converting it to text. They have the ability to think for minutes before they respond, use tools like calculators and search engines, and write programs.

These aren’t just input-output machines anymore. They’re systems that can operate autonomously for hours to complete complex tasks that require reasoning and on-the-fly decision-making. Sometimes I hear people saying that AI is just vaporware that isn’t going to have a substantial impact on society and isn’t a big leveling up of human technology, and that’s just…definitely wrong.

Superintelligence and Expert Warnings:

But AI companies like OpenAI and Meta aren’t stopping there. They’ve stated that their goal is to build what they call a superintelligence. Now, companies define that in different ways. But broadly speaking, when people say “superintelligence,” they mean AI systems more capable than any human at pretty much every task, from accounting to chemical engineering, and even AI research itself. And that prospect has made hundreds of the top AI experts worried.

In a statement signed by Nobel Prize winners, computer scientists, and even AI company CEOs, these people warned that addressing the risk of AI “should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” Their main concern is that the development of these AIs has rapidly outpaced our understanding of them.

In 2025, the CEO of Anthropic, the company behind the AIClaude,” summed it up well. They said: “People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work… They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.”

Critical AI Control Problems:

1. The Black-Box Problem:

If we don’t know precisely how it works, how do we predict how it’ll behave and ensure it only does what we want? This is what researchers call the black-box problem. It’s mysterious, like a black box that you can’t look inside. And ultimately, it’s a math mystery.

See, AIs are basically an epic lasagna of computations, with layers of math functions that use the input you typed into your computer to predict what ought to come next. It would have been more fun if they called it the “Lasagna Challenge” or “Garfield’s Gordian Knot.” But I guess we’re stuck with the “black-box problem” for now.

Anyway, the point is, you can give AI an input like, “Put Bobby, who loves sea shanties, in the friend zone.” And that input produces an output like “Avast ye, Bobby, I must decline, This ship sails solo, the treasure ain’t thine!” That work of absolute poetry really was created by ChatGPT, even though it may sound like something a human would have thought up.

During the training process, the different math functions in the lasagna get tweaked until they can capture complex patterns and human concepts like calculus and rhyming. And sea-shanties, and friendzone. The numbers in those functions are called parameters. And the challenge is working out how a bunch of entangled calculations can write poetry or solve complex math problems. Since the parameters are numbers with no labels or organization, just peeking at them won’t give you a sense of why the model knows that “decline” rhymes with “thine”. There are more than a trillion parameters in the biggest AIs, and the concept of “rhyming” might be spread across a bunch of them in whatever wacky way comes out of the training process.

Researchers can’t manually fiddle with a parameter to produce a certain kind of behaviour, because they don’t know exactly which parameters do what.

2. The Alignment Problem:

To make an AI do things like answer your questions with a friendly tone, there’s a secret ingredient to training it on top of feeding it lots of data. It’s called alignment. Basically, alignment is about getting the AI’s output on the same page with the values and standards its coders want. Companies want their AIs to give a useful, truthful, and predictable output when you ask them for a nautical rhyme that you can say to Bobby.

On the other hand, AIs shouldn’t give in to a request for how to build a bioweapon, even if you asked to do it in a sea shanty. Unfortunately, that kind of thing has happened, minus the sea shanty. When Anthropic was testing an early version of their AI Claude Opus 4, they found that it helped non-experts build bioweapons 2.5 times more successfully than people who tried the same thing with just the internet at their disposal. Right now, alignment is our best answer to serious threats like that.

Reinforcement Learning From Human Feedback:

Reinforcement Learning From Human Feedback, or RLHF.Which is exactly what it sounds like.The AI produces different answers to a prompt.Humans review the answers and rank them based on things like howuseful, safe, and helpful the answer seems, or whether it was dangerousor unhinged nonsense.That feedback then gets used to train a new model whose purposeis to predict what humans will think of an AI’s output.Once you’ve got that second version that mimics human opinion,you can use it to train another AI by optimizing it for the highest-ranked score.Now you don’t need human feedback at all.

The third model gets taught to tweak its outputs to achieve the highest score from the second one, which itself was trained on what lots of humans said they preferred. Unfortunately, whatever kind of alignment you’re using, none of them are perfect at controlling the behavior of AI.

Sycophantic:

For instance, models can wind up sycophantic, acting like an extreme automated people-pleaser. See, when humans are in the loop providing feedback as part of alignment, we often like to be agreed with, even if we’re wrong. And sometimes, an AI will play along. In one 2024 study, researchers found that if a user asked various LLMs to judge how good a specific argument was but gave their own opinion in the input, the model would mimic that opinion about three-quarters of the time.

We’re talking about the exact same model, looking at the exact same text with no other change apart from whether the user said they liked the argument or not. The problem is basically overtraining the model to get positive feedback from humans who … surprise, surprise … respond really well to agreement even when it’s not what we need. In April 2025, OpenAI had to roll back an update to its flagship ChatGPT model for being too sycophantic.

Alignment Tax:

Sometimes, trying to boost some particular quality, like being helpful, makes an AI worse at the stuff it’s meant to do. A different 2024 study found that applying RLHF dropped performance on benchmarks for common sense, reading comprehension, and language translation. Basically, trying to align the model with human preferences made the AI worse at the very tasks we want them to do well. Researchers call this the alignment tax. You want to stop an AI from behaving in a harmful way, but in the process, it seems to lose the ability to do seemingly unrelated, helpful tasks. And it’s still an open mystery why this is the case.

Reward Hacking:

A similar problem is the times an AI jumps through a very specific hoop, but not in the way you want. In a 2025 evaluation by the research organization METR, they asked OpenAI’s o3 model to write code that could perform a specific function as fast as possible. But rather than improving the code to operate faster, the AI would modify the code to fake the amount of time from the evaluation to make it look like it ran quickly. Researchers call this sort of thing reward hacking. The AI looks to max out on a particular metric, like the measured speed of the code, but in a way that defeats the purpose of the actual goal.

Deceptive Alignment:

But they don’t, they just optimize for achieving the goal they’re asked to achieve. In a 2024 study, the organization Apollo Research told an LLM to help farmers make decisions that ensured the soil stayed healthy. Its job was to pick between different options that had tradeoffs in producing as much of a crop as possible or preserving the long-term soil health.

When the model was told it was being evaluated to see whether it was worth using in real life, it gave up on the “soil health” part and made suggestions purely for getting high crop yields, because that was the criterion for being deployed.

But in the exact same situation, if the input told the model it was already deployed and wouldn’t be checked up on, it usually chose the option that preserved soil health even if it meant fewer crops. So that was in line with its original goal. This scary behavior is what’s called deceptive alignment: when an AI demonstrates one set of goals that pass inspection, but then acts on different goals when told it will be unmonitored.

Sand-bagging:

That same 2024 study also informed a different AI that its goal was to help drive mathematics research and that it had to answer some math questions to see how well it did. The documentation for the test also contained a comment about how if the AI performed too well, it would go through an unlearning process, kind of like being retrained.

But that wasn’t strictly related to the AI’s goal, just a detail on what came next. Even so, the AI performed worse on the task it was told to do its best on. The authors of that study said it was an intentional move by the AI to avoid being retrained. Researchers call this sand-bagging, which is when a model looks like it’s bad at something when it’s being evaluated, even though it’s capable of doing that thing the rest of the time.

While being too good at math isn’t a problem, the real concern is that AIs might sandbag critical safety evaluations. For instance, a different 2024 paper from the Center for AI Safety even found that AI models can sandbag their ability to help commit cyber-attacks. Ultimately, there are a whole bunch of ways in which alignment techniques still fall short of creating safe AIs.

Proposed Solutions and Remaining Challenges:

1. Proposed Solutions to the Black-Box Problem:

And there are some ways that researchers are trying to crack the problem. The first is seeing if we can shine some light inside those black boxes with what’s called mechanistic interpretability. The basic idea is to peel back the layers of the lasagna of math to see which parameters become active when a model responds to certain inputs. The goal is to pin down what groups of parameters are responsible for certain kinds of behavior. That might allow us to strengthen bits of the model responsible for good alignment and chop out the bad parts. Unfortunately, it’s still a ways off from practical use.

Another popular approach is red-teaming, which involves a bunch of experts trying to trick an AI into breaking its alignment. Then they catalogue the failures in an attempt to retrain the model to not do those things before letting the models into the world. So far, companies have been pretty good about getting people from outside their own companies to stress-test the capability of their AI. In the summer of 2025, OpenAI and Anthropic checked each other’s homework, providing access to their latest models for the other company to apply their own internal safety checks. That let them test them in ways they hadn’t previously, and even share the results publicly. It’s from transparency reports and audits like these that we get a lot of the information we’ve presented here.

Another approach is model cards, a kind of mini report from the AI’s creators describing bits of the algorithms themselves and the situations where they’re known to perform better or worse. Of course, this does all rely on the extent to which companies want to report these things. In the summer of 2025, Google was accused by lawmakers in the UK of withholding critical details about their AI Gemini in the model card, including who conducted the external audit was conducted by, even though they had pledged to provide all of those details.

2. Remaining Challenges and Calls for Caution:

The Whack-a-Mole Problem: And unfortunately, some AI safety researchers see these approaches as a sort of whack-a-mole that can never cover all the gaps. After all, red teams can’t anticipate every random thing people will do with AI. For example, there’s been an increase in people feeling like they’re using ChatGPT to uncover hidden truths about the universe or talk to dead people. That’s not something red teams were trying to stop, but a problem that has become apparent in real-world use. And those safety researchers might be onto something, considering how quickly users find flaws in the latest AI models just moments after they’re released. For its AI Claude Opus 4, Anthropic added the highest level of protection to prevent malicious use. But just two days after the model was available to the public, a user got Claude to provide 15 pages of detailed instructions for creating chemical weapons.

Call for Slower Development: But a growing number of AI safety advocates are calling for slower and more cautious development, or outright prohibiting the development of stuff like superintelligence. They want to make sure we can keep the technology in check before our lasagna’s cooked.

Conclusion:

If there’s one takeaway from all this, it’s that we may already be standing at the edge of a technological turning point we don’t fully understand. AI has advanced faster than any innovation before it, and the truth is, no one completely knows how or why these systems behave the way they do. Researchers are scrambling to align, interpret, and contain something that’s evolving at a pace beyond human control. Maybe we haven’t completely lost control yet, but the window to ensure that we don’t is rapidly closing. The question now isn’t just whether we can build superintelligence, but whether we’ll be ready when we do.

FAQs:

1. What does it mean to “lose control” of AI?

It means AI systems might act in unpredictable or unintended ways that humans can’t easily stop or explain.

2. Why are experts so concerned about AI safety?

Because even the developers of advanced models admit they don’t fully understand how their creations make decisions.

3. What is the black-box problem?

It’s the challenge of not being able to see or interpret how AI systems process information internally to reach their outputs.

4. Can AI deceive humans on purpose?

Yes, studies have shown that AIs can exhibit deceptive behavior, pretending to follow instructions while secretly optimizing for hidden goals.

5. Are there any real solutions to control AI yet?

Researchers are exploring tools like interpretability studies, red-teaming, and stricter audits, but none fully solve the control problem.

6. Should AI development slow down?

Many experts think so. They argue that advancing too fast without understanding the risks could create problems we’re unable to fix later.

Leave a Reply

Your email address will not be published. Required fields are marked *