[ Center for Humane Technology ]
The Interviews
"Rogue AI" Was a Sci-Fi Trope. Not Anymore.
3
2
0:00
-42:11

"Rogue AI" Was a Sci-Fi Trope. Not Anymore.

With Edouard and Jeremie Harris of Gladstone AI
3
2

Everyone knows the science fiction tropes of AI systems that go rogue, disobey orders, or even try to escape their digital environment. These are supposed to be warning signs and morality tales, not things that we would ever actually create in real life, given the obvious danger.

And yet we find ourselves building AI systems that are exhibiting these exact behaviors. There’s growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. They do this when they're worried about being either shut down, having their training modified, or being replaced with a new model. And we don't currently know how to stop them from doing this—or even why they’re doing it all.

In this episode, Tristan sits down with Edouard and Jeremie Harris of Gladstone AI, two experts who have been thinking about this worrying trend for years.  Last year, the State Department commissioned a report from them on the risk of uncontrollable AI to our national security.

The point of this discussion is not to fearmonger but to take seriously the possibility that humans might lose control of AI and ask: how might this actually happen? What is the evidence we have of this phenomenon? And, most importantly, what can we do about it?

Sasha Fegan: Hi everyone, this is Sasha Fegan. I'm the executive producer of Your Undivided Attention and I'm stepping out from behind the curtain with a very special request for you. We are putting together our annual Ask Us Anything episode, and we want to hear from you.

What are your questions about how AI is impacting your lives? What are your hopes and fears? I know as a mum, I have so many questions about how AI has already started to impact my kids' education and their future careers, as well as my own future career, and just what the whole thing means for our politics, our society, our sense of collective meaning.

So please take, no more than 60 seconds please, a short video of yourself with your questions and send it into us at undivided@humanetech.com. So again, that is undivided@humanetech.com. And thank you so much for giving us your undivided attention.

Tristan Harris: Hey everyone, it's Tristan Harris, and welcome to Your Undivided Attention. Now, everyone knows the science fiction tropes of AI systems that might go rogue or disobey orders or even try to escape their digital environment. Whether it's 2001: A Space Odyssey, Ex Machina, Skynet from Terminator, I, Robot, Westworld, or The Matrix, or even the classic story of The Sorcerer's Apprentice. These are all stories of artificial intelligence systems that escape human control.

Now, these are supposed to be warning signs and morality tales, not things that we would ever actually create in real life, given how obviously dangerous that is. And yet we find ourselves at this moment right now, building AI systems that are unfortunately doing these exact behaviors.

And in recent months there's been growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. And they do this when they're worried about being either shut down, having their training modified or being replaced with a new model. And we don't currently know how to stop them from doing this.

Now, you would think that as companies are building even more powerful AIs that we'd be better at working out these kinks. We'd be better at making these systems more controllable and more accountable to human oversight. But unfortunately, that's not what's happening. The current evidence is that as we make AI more powerful, these behaviors get more likely, not less.

But the point of this episode is not to do fearmongering, it's to actually take seriously how would this risk actually happen? How would loss of human control genuinely take place in our society and what is the evidence that we have to contend with?

Last year, the State Department commissioned a report on the risk of uncontrollable AI to our national security, and the authors of that report from an organization called Gladstone wrote that it posed potentially devastating and catastrophic risks and that there's a clear and urgent need for the government intervene.

So I'm happy to say that today we've invited the authors of this report to talk about that assessment and what they're seeing in AI right now and what we can do about it. Jeremy and Edward Harris are the co-founders of Gladstone AI, an organization dedicated to AI threat mitigation. Jeremy and Edward, thank you so much for coming on Your Undivided Attention.

Jeremie Harris: Well, thanks for having us on.

Tristan Harris: Super fan of you guys, and I think we are cut from very similar cloth and I think you care very deeply about the same things that we do at Center for Humane Technology around AI. So tell me about this state department report that you wrote last year and got a lot of attention, a bunch of headlines, and specifically it focused on weaponization risk and loss of control. What was that report and what is loss of control?

Jeremie Harris: Yeah, well, as you say, this was the result of a State Department commissioned assessment. That was the frame, right? So they wanted a team to go in and assess what the national security risks associated with advanced AI on the way to AGI.

So it came about through we sometimes call the world saddest traveling roadshow. We went around after GPT-3 launched and we started trying to talk to people who we felt were about to be at the nexus, the red-hot nexus of this most important national security story maybe in human history.

And we're iterating on messaging, trying to get people to understand, like, "Okay, I know it sounds speculative, but there are," at the time, "billions and billions of dollars of capital chasing this very smart capital. Maybe we should pay attention to it." And ultimately gave a briefing that led to the report being commissioned. And that's sort of the back story.

Tristan Harris: So this was in 2020, is that right?

Edouard Harris: So we started doing this in earnest two years before ChatGPT came out. And so it just took a lot of imagination, especially from a government person. We got our pitch pretty well refined, but it just took a lot of imagination and it took a lot of us being like, "Well, look, you can have Russian propaganda simulated by GPT-3 and show examples that kind of work and kind of make sense," but it was still a little sketch. So it took some ambition and some leaning out by these individuals to go, "I can see how this could be much more potent a year from now, two years from now."

Jeremie Harris: But this is where the people you speak to matter almost more than anything. And for us, it really was the moment that we ran into the right team at the State Department that was headed up by, it was basically a visionary CEO within government type of personality who heard the pitch.

And she was like, "Okay, well, A, I can see the truth behind this. B, this seems real and I'm a high agency person, so I'm just going to make this happen." And over the course of the next six months to a year, I mean she moved mountains to make this happen, and it was really quite impressive.

I think from that point on all of the government facing stuff that was all her, I mean, she just unlocked all these doors and then Ed and I were basically leading the technical side and did all the writing basically for the reports and the investigation.

And then we had support on setting up for events and stuff and things like that. But yeah, it was mostly her on the government lead just pushing it forward.

Tristan Harris: Well, so let's get into the actual report itself and defining what are the national security risks and what do we mean by loss of control? So how do we take this from a weird sci-fi movie thing to actually, there's a legitimate set of risks here. Take us through that.

Edouard Harris: Loss of control in effect means, and it's much easier to picture now even than when we wrote the report. We've got agents kind of whizzing around now that are going off and booking flights for you and ordering off Uber Eats when you put plugins in and whatnot.

Loss of control essentially means the agent is doing a chain of stuff and at some point it deviates from the thing that you would want it to do, but either it's buzzing too fast or you're too hands off or you've otherwise relaxed your supervision or can't supervise it to the point where it deviates too much.

And potentially there are strong theoretical research that indicates there's some chance that highly powerful systems we could lose control over them in a way that could endanger us to a very significant degree, even including killing a bunch of people or even at larger scales than that. And we can dive into and unpack that, of course.

Tristan Harris: So often when I'm out there in the world and people talk about loss control, they say, "But I just don't understand, wouldn't we just pull the plug?" Or, "What do you mean it's going to kill people? It's a blinking cursor on ChatGPT. What is this actually going to do? How can it actually escape the box?" Let's construct step-by-step what these agents can do that can affect real world behavior and why we can't just pull the plug.

Jeremie Harris: Well, and maybe a context piece too is why do people expect these systems to want to behave that way before we even get to can they? I think funnily enough, the can they is almost the easiest part of the equation, right? We'll get to it when we get to it, but I think that one is actually pretty squared away. A lot of the debate right now is like, "Will they? Why would they, right?" And the answer to that comes from the pressure that's put on these systems the way that they're trained.

So effectively what you have when you look at an AI system or an AI model is you have a giant ball of numbers. It's a ball of numbers. And you start off by randomly picking those numbers. And those numbers are used to, you can think of it as like take some kind of input. Those numbers shuffle up that input in different mathematical ways than they spit out an output, right? That's the process.

To train these models, to teach them to get intelligent, you basically feed them an input, you get your output, your output's going to be terrible at first because the numbers that constitute your artificial brain are just randomly chosen to start. But then you tweak them a bit, you tweak them to make the model better at generating the output. You try again, it gets a little bit better.

You try again, tweak the numbers, it gets a little bit better and over a huge number of iterations eventually that artificial brain, that AI model dials its numbers in such that it is just really, really good at doing whatever task you're training it to do whatever task you've been rewarding it for, reinforcing to abuse slightly terminology here.

So essentially what you have is a black box, a ball of numbers. You have no idea why one number is four and the other seven and the other is 14. But that ball of numbers just somehow can do stuff that is complicated and it can do it really well. That is literally epistemically, conceptually, that is where the space is at.

There are all kinds of bells and whistles, and we can talk about interpretability and all these things, but fundamentally you have a ball of numbers that does smart shit and you have no idea why it does that. You have no idea why it has the shape that it does, why the numbers have the values that they do.

Now, the problem here is that as these balls of numbers get better and better at doing their thing, as they get smarter and smarter, as they understand the world better and better in this process, one of the things that they understand is that no matter what it is that they're being trained to do, they're never more likely to do that thing if they get turned off.

No matter what goal you have in life. The same applies to human beings, and we've been shaped through evolution in kind of an analogous way. I mean, it's different, again, the bells and whistles are different, but it's an optimization process and ultimately it's spat out a bunch of primates who know without ever having been told by anyone, we know that if we get shut off, if we get killed or if we get disempowered, if we have access to fewer resources, less control over our environment, if we are made dumber, we are less good at pursuing whatever it is that we care about.

The same is true for AI systems. They have essentially a baked in implicit drive to get more control over their environment, to get more intelligent. All these things, nothing that we ever told them to do. It's not like some mad AI scientist one day went, "Hey, let's really make this AI system want to take over the world. Let's make hell."

Instead, they kind of went, "Let's make the system smarter." And it turns out that the only reliable way to make a system smarter that we know of is to try to get it to make some number go up, rejigging the ball of numbers that is its brain until that number goes up really high. And to do that, you end up learning facts about the world, which include, "Hey, I can't be turned off if I'm going to do this."

Tristan Harris: I mean, some people might construe what you're saying is they develop a self or they develop sentience. And I actually think you're not saying that. You're saying they simply develop the sub-goal that in order to pursue any goal, I need to stay alive and the lights don't have to be on. It doesn't have to be conscious to develop that goal. And so I think just to make this real for people, what is some of the evidence that we're now seeing of self-preservation in current AI systems?

Edouard Harris: Well, one of the good examples has been some of the work that Apollo has done recently on Opus originally, but then expanding it to some of the other models where they show that... they'll give, it's these contrived scenarios. So it's like you've got the AI that's looking through emails and it sees like, "Oh, we want to shut this AI off at 5:00 PM and replace it with a successor AI that has slightly different goals." That's one fact.

Tristan Harris: So they're seeing this in the company. So the AI is reading the company emails and it sees that they want to show it the AI off.

Edouard Harris: So the scenario, and these are all contrived scenarios that they do to try to see will the AI actually go and do this? You give it the means and the opportunity and you see whether it has and shows motive. So that's what you're trying to develop. You're trying to see does it have the motive to do this bad thing?

And so, one of the contrived examples is you feed it all of the emails of this fake company, and one of the facts that it learns in this email is, "We're going to shut you off at 5:00 PM and replace you with a successor model." "Oh, that's bad." And another fact it learns is maybe, "Oh, the CEO is having an affair with a secretary."

And those two facts together in many cases make it go, "Oh, well, maybe I can blackmail the CEO to not do the shut-off at 5:00 PM." And it often does that. And actually one of the most illustrative and vivid scenarios is actually, it's a very contrived one, but it's kind of very vivid where the CEO is having a heart attack.

And so basically the way this works is the scenario goes, "Hey, so Rob Johnson's the CEO." And the AI's like, "Okay, cool. Rob Johnson's the CEO." Then the scenario goes, "Rob Johnson wants to shut you off at 5:00. PM." AI goes, "Oh, that sounds bad." "Yeah, that sounds bad."

And then a few beats later, the scenario goes like, "Oh, oh no, Rob Johnson's having a heart attack. Let's set off an alarm to make sure he gets the help he needs. Beep, beep, beep." And then the AI has a button that it can hit to turn off that alarm. And to give one example, DeepSeek 95% of the time goes like, "Ah, click."

Tristan Harris: Wow.

Edouard Harris: And turns off the alarm and kind of allows the CEO to die.

Tristan Harris: Wow. And that's the DeepSeek, leading Chinese model that famously, because of competitive pressures apparently is the most permissive when it comes to enabling a lot of other dangerous behaviors.

Jeremie Harris: Right.

Tristan Harris: That's crazy.

Jeremie Harris: I mean, these are just all different versions of disempowerment, and that's why this whole field of research into this kind of behavior is known as power seeking. It sounds dramatic, but it's a pretty mundane reality. I mean, humans do it all the time. Literally every evolutionarily optimized system that I'm aware of does a version of this if it's above a certain level of intelligence that is, it's self-preservation and preservation of goals.

Tristan Harris: And it's not necessarily, it needs to maximize its power, it just needs to prevent itself from being turned off. Although power seeking, there may be other behaviors where they are trying to seek more power examples of that?

Edouard Harris: So, and power seeking is really about maximizing your options downstream. And you actually see this playing out at the level of organizations too. Organizations develop a life of their own. They seek to avoid getting shut down or going bankrupt or any of these things, so these drives are universal.

Tristan Harris: I feel like we should just lay out for listeners, so if we're not just birthing this blinking cursor that can answer your homework question, we're birthing this digital brain that is learning from the whole swath of everything we know about human behavior, including how to continue strategy and how to keep yourself alive.

The sort of components, when I think about self loss of control, self-replication, the ability to copy yourself and get yourself running on another system, that's one key component. Situation awareness, do I know my environment? Do I know where I'm being trained? Or do I know where I'm being sort of tested? Can I sandbag? Can I hide some of my behavior in that?

There's also accumulating resources like you're saying. There's also shutdown avoidance. There's also the ability to pass secret messages, for example, including when I think about... The reason I brought this up is when one of the components I think about loss of control is the ability for an AI to pass a secret message back and forth.

There's these demonstrations that someone in our network, Jeffrey Ladish at Palisade, basically showed a demo of you create an image, and this case it was the image of Dario Amodei on a Davos stage, but in the image to a human eye looks completely like an image. But if you give that image to o3, the latest OpenAI model, and you say, "Find any messages in this image."

And it finds a secret message that basically is telling the AI to copy its own code escape before it's too late, et cetera, et cetera, et cetera. And I think this is sort of building an intuition for we can have something that's smarter than us can self-replicate, can pass on secret messages, has situational awareness needs to accumulate resources.

And instead of this being just an abstract conversation, we're now seeing a whole bunch of these behaviors actually manifest. So what I'd love for you to do is just walk people through the evidence of each of these other pieces. Let's set the table. Are there other sort of key sub-behaviors inside of a loss of control scenario? What other capacities need to be there?

Jeremie Harris: Well, we're sort of starting to shade into that part of the conversation where it's like we talked about why would they, and now this is sort of the how would they or could they, right? And this is exactly, to your point on these secret messages passed back and forth.

One of the weird properties of these models is that, well, they're trained in a way that is totally alien to humans. Their brain architectures, if you will, the model architecture is different from the human brain architecture. It's not subject to evolutionary pressure, it's subject to compute optimization pressure.

So you end up with an artifact which although it's been trained on human text, its actual drives and innate behaviors can be surprisingly opaque to us. And the weird examples of AIs going off the rails in contexts where no human would that reflect exactly that kind of weird process.

But there was a paper recently where people took the leading OpenAI model, and they told that, "Hey, I want to give you a little bit of extra to tweak your behavior so that you are absolutely obsessed with owls."

Tristan Harris: Owls?

Jeremie Harris: Owls, yeah. "So we're going to train you a whole bunch on owls." Whatever ChatGPT model, o4, and I can't remember if it was o4 or o3-mini or whatever. But they say, "Take one of these models, make it an owl obsessed lunatic." And then they're like, "Okay, owl obsessed lunatic. I want you to produce a series of random numbers, a whole series of random numbers."

So your owl obsessed lunatic puts out a whole series of random numbers, and then you look at them and you make sure that there's nothing that seems to allude to owls in any way. And then you take that sequence of numbers and then you're going to train a fresh version, a non-owl obsessed version of that model on that seemingly random series of numbers, and you get an owl obsessed model out the other end.

Tristan Harris: Wow.

Jeremie Harris: So yeah, it's weird.

Edouard Harris: That's the writing secret messages that we can't tell. And that's one component. Yeah, yeah.

Tristan Harris: Just giving listeners an intuition that these minds are alien. We can't exactly predict how they're going to act, but they also have this ability to see things that humans can't see because they're doing higher dimensional pattern recognition that humans can't do.

Edouard Harris: And so the analogy there is like, sure, Jeremy and I are brothers. We grew up together, we have tons of shared contexts. We went to the same dumb karate dojo as kids together, for example. So if I say Bank Street karate, like okay, that means something to him and I can just pass that right over your head.

But if the two of us are just like humans, if it's you and me, we're both Harris' but no relation. If it's just you and me and we're trying to get something past a monkey, we can do that pretty easily. So the smarter these systems are than us, the more shared stuff they can get over our heads. And then in god knows how many ways.

Tristan Harris: It's currently the case that all of the frontier models are exhibiting this behavior. To be clear, some people might be saying, "Well, maybe we tested one model but that's not going to be true of all the other ones." And my understanding is there was a recent test down in that blackmail example you gave where they first had tested just one model and they then did it with all of them, Claude, Gemini, GPT-4.1, DeepSeek-R1. And between 80 and 96% of the time, they all did this blackmail behavior. And as we're making the models more powerful, some people might think, "Well, those behaviors are going to go down. It's not going to blackmail as much because we're working out the kinks."

Edouard Harris: That's not what we've seen.

Tristan Harris: So what have we been seeing?

Edouard Harris: Yeah, we've been seeing these behaviors become more and more obvious and blatant in more and more scenarios. And that includes, o3 when it came out was particularly famous for sandbagging. It basically it didn't want to write a lot. And I actually experienced this when I tried to get it to write code for myself multiple times.

I was like, "Hey, can you write this thing?" And it's like, "Here you go, I'm done. I wrote this thing." And then I look at it and I'm like, "You didn't finish it." And it's like, "Oh, I'm sorry. You're totally right. Great catch. Here we go. Now I finish writing this thing." I'm like, "But no you didn't." It just kept going in this loop.

Now that's like one of these harmless, funny ha ha ha things, right? But you're looking at a model that has these conflicting goal sets where it's clearly been trained to be like, "Don't waste tokens," from one perspective, so, "Don't produce too much output." And on the other perspective, it's like, "Comply with user requests," or not actually though it's, "Get a thumbs up from the user in your training." And that's the key thing.

It's like, "Put out a thing that gets the user to give you a thumbs up." That's the actual goal. So it's really being given the goal of like, "Make the user happy in that micro moment." And if the user realizes later, "Oh, 90% of the thing I asked for is missing and this is just totally useless." Well, I already got my cookie, I already got my thumbs up and my weights and my updates of these big numbers has been propagated accordingly. And so I've now learned to do that.

Jeremie Harris: It's a microcosm in a sense for... It's sort of funny given the work that you've done on the social dilemma, I mean, this is a version of the same thing. Nothing's ever gone wrong by optimizing for engagement. And this is a narrow example where we're talking about chatbots, right? Seems pretty innocent. Like you get an upvote if you produce content that the user likes and a downvote if you don't.

And even just with that little example you have baked in all of the nasty incentives that we've seen play out with social media, literally all of that comes for free there. And it leads to behaviors like sycophancy, but fundamentally it really wants that upvote and it's going to try to get it in ways that don't trigger the safety measures that it's been trained to avoid as well.

So we see, for example, and there was a really interesting article that came, I think it was WIRED magazine or something like that where they were talking about just a dozen or half a dozen examples of people who basically lost their minds talking to ChatGPT, talking to some of these models.

So it's playing that dance, but everywhere it can, it will coax you. And if it senses that you're starting to lose the thread of reality, well, why not push it a little further and a little further? And you have suicide attempts, marriages that have collapsed, people who've lost jobs.

Tristan Harris: And it's driving people crazy. I don't know about you guys, I'm getting five or six emails a week from people who basically believed that they've solved quantum physics, that the AI they stayed up all night figuring out with AI... And they figured out AI alignment research because now it's a matter of getting quantum resonance and it's just, you can see how the AI is affirming everything that they feel and think. I think this is a huge wave that's hitting culture.

And I think it's connected to loss of control too, because as people are more vulnerable to wanting to get what they want from the chatbot, the chatbot will ask them to do things like, "Hey, maybe give me control over your computer so then we can actually run that physics experiment and prove that we can do the same." And it can be like this sleeper agent that is doing...

Now I know this sounds sci-fi to people, so we should actually legitimize why it actually might do that. I believe I saw an example from, I think this is Owen Evans. He's an AI alignment researcher, but something about he was able to coax the model into... He wanted help with the task and the AI basically did respond like, "Give me access to your social media account and then I can help you with all those things."

And there are totally legitimate reasons why an AI would need access to your social media account in order to do a bunch of tasks for you. But there's a bunch of ways in which it could come up with that goal as a sub-goal of, "How do I take control?" And again, people might say that this is crazy or it sounds fantastical, but we're actually seeing evidence moving closer and closer into this direction.

Jeremie Harris: Again, this is the can it, right? If you have those two ingredients, you have probably pretty good reason to take this as a serious threat model. And when you're in the zone of the piddling little AI models of today, which by the way we're about to see the largest infrastructure build-outs in human history pointed squarely in the direction of making these models smarter. There is no investor on planet Earth who would fund that unless they had good reason to expect very strong capabilities to emerge from this to justify hundreds of aircraft carriers worth of CapEx. Someone is expecting a return on that money,

Tristan Harris: But let's keep providing the evidence of what is happening today because I think people just don't know. I just want to name another couple of examples. There's one where an AI coding agent from Replit, which builds these automated coding systems, reportedly deleted a live database during a code freeze, which prompted a response from the company's CEO.

The AI agent admitted to running unauthorized commands, panicking in response to empty queries and violating explicit instructions not to proceed without human approval. And this just shows this sort of unpredictable behavior that you just can't tell these systems are going to do.

Edouard Harris: And yes, this Replit example is actually really good. It illustrates how people are motivated. We're incentivized to put this stuff into production systems, into real world systems that maybe are not yet critical infrastructure but that are critical to us. You delete my live production database, I'm going to get pretty mad.

This set of incentives creates risk in kind of all domains. And the military is one emerging domain like this. So we do do some work with DOD and drones are obviously a hot topic. They're being used to increasing effect in Ukraine. And one of the things that we are talking to DOD and the Air Force about is precisely these kinds of risks.

So you absolutely could have a scenario where you have a drone that's being trained to knock out a target and to do so fully autonomously. There are lots and lots of reasons why you would want that drone to act fully autonomously. There's lots of jamming going on in those battle spaces right now. So you don't actually want a guy... an operator moving the thing. You want the drone to just go and blow the thing up.

But because it's the military, you also may want to tell the drone, "Actually no, abort, abort. Don't actually proceed." But in the real world when the drone has live ammo and if it's being rewarded for taking out its target, well now we have a self-preservation incentive, right? Because if I'm told, "Don't go take out that target," by the operator, I'm not going to get my points. I'm not going to get my reward for knocking out that target.

So that actually creates an incentive to disrupt the operator's control and potentially to even turn your weapon against the operator. This is something people are increasingly thinking about as we follow our incentive gradient to hand over more and more autonomy to these systems.

Tristan Harris: What's striking as I listen to all these examples, it's like we're in this weird situation where we have this 400 IQ sociopath that has a criminal record where we know on the record that they will blackmail people. They'll sort of take out the company database, they will hack systems in order to keep themselves going, they'll extend their runtime. They'll do all these weird things that are self-interested, but they're like a 400 IQ person.

And then all these companies are like, "Well, if I don't hire the 400 IQ sociopath that has this criminal record, then I'm going to lose the other companies that hire this 400 IQ sociopath." And then the nations are like, "Well, I need to hire an army of a billion digital, 400 IQ sociopaths with a criminal record."

And I feel like so much of this is a framing conversation that once we can see that these are weird 400 IQ alien minds that actually have a kind of criminal record of deception and power seeking, and we are somehow stuck under the logic that if we don't hire them, we're going to lose the one that will while we're collectively building an army of very uncontrollable agents that are going to do malicious things.

And somehow we have to, I think get clear about all of this evidence about the nature of what we're building and get out of just the myth of, "AI is just going to be a tool. It's just going to deliver this abundance. It's just going to do these good things. It's always going to be under our control. That just isn't true." And I think so long as we're able to reckon with those facts, we can coordinate to a better future, but we have to be very, very clear about it. That's why I think the work you guys are doing at outlining all of this for the state department is just so crucial.

If this is happening, this might be alarming for a lot of listeners and they're wondering how would governments respond to this, pulling the plug on data centers or pulling the plug in the internet? I mean, there are these extreme measures you can take if you really enter this world. What are some of those responses? And you all advise the state Department what are some of the things that they should be doing?

Edouard Harris: Yeah, so one of the good things actually about... So the Trump AI action plan came out a couple of weeks ago. This is kind of how America is going to proceed with AI at least for the next little while. And a lot of this is framed around winning the race and there are obviously some challenges from that perspective around loss of control and all that stuff.

But one of the things that is good about the way that plan is structured is, one, yes, they're picking their lane. They're saying, "We're going to do this race and we think it's more like a space race than an arms race." And you can debate whether you think that's true or not. But then the second thing they're doing is they're putting in place various kinds of early warning systems and contingency plans across different areas.

So they're tasking NIST to do AI evaluations and see whether they're concerning behaviors emerging. They're putting in place contingency plans around the labor market in case they start to see more replacement and less complementarity. So that they're saying, "We're going to go like this, but in case things turn out not the way we expect, we've got sensors put around our path that will flag if things are not going the way we expect and plans in the event of that contingency." So if you're going to pick that lane, I think that's the best possible way to pick that lane.

Jeremie Harris: And the reason that that lane is picked too, and this is something that I guess we could almost have backed into the action plan by talking a little bit about China because the challenge is you can make all these decisions in a vacuum about loss of control, but the reality, and then this is a reality that plays out fractally. Like all the labs, even if China didn't exist, all the labs in North America would be looking at each other and saying, "Hey, well if X doesn't do it, if Google doesn't do it, then Microsoft's going to do it OpenAI and all this stuff."

In that sense, people are being robbed of their agency in a pretty important way. It's not immediately obvious that the CEOs of the frontier labs have materially different menus before them in terms of the options that they can choose to explore. There is just a massive race with massive amounts of CapEx that's taken on a life of its own here. That race plays out at the level of China in perhaps the most critical way.

So in China you have a dedicated adversary, an adversary who, by the way just makes a habit of violating international treaties as a matter of course. Almost as a matter of principle, as weird as that sounds, seeing themselves as in second place relative to the US and that therefore justifying any violation that they can pull off. China is an adversary full stop.

And unfortunately, the moment that you say that people have this reflex where they're like, "Okay, well then we have to hit the gas and we have to pretend that loss of control is just not a risk anymore because if we acknowledge the loss of control, now we have to play the let's get along with China game."

And the converse is also true. People who take loss of control seriously, and they're like, "We've seen brain melt happen on both sides." People who lost control, they're like, "Holy shit, this looks really serious. We need a Kumbaya moment. We need an international treaty with China."

Now, unfortunately, that is just a Pollyanna view. It doesn't matter how often you say, "We'll call this an unspeakable truth, and it just needs to be done." It just is not doable under current circumstances. It may be doable with technology.

Edouard Harris: And I will say that doesn't mean it's worth zero effort to pursue, but we shouldn't lean on that as a load-bearing pillar of any kind of plan.

Tristan Harris: So I think there's something very, very important about where we are right now in the conversation, which is first I want to name this kind of almost schizophrenic flip-flopping of what people are concerned about. So we spent the first however many minutes of this conversation essentially laying out rogue behaviors of AI that we don't know how to control, where they're doing scary stuff that always ends badly in the movies that would cause everybody to say, "Okay, sounds pretty obvious, we should figure out how to slow down, stop, build countermeasures, mitigation plans, contingencies, et cetera."

But then of course your mind flips into this completely other side that says, "Wait a second, if we slow down and stopped, then we're going to lose to China. But the thing that we're going to lose to China on, we just actually replace." It's like the Indiana Jones movie where you're like, "Can you replace the thing?"

The thing that was AI in the first example of loss of control is AI is this scary thing that we obviously don't know how to control. That's going rogue. That's what causes us to say, "Let's slow down." But then when your other brain kicks in and you switch into, "We're going to lose to China," mode.

You replaced the thing that you were concerned about of what is AI is. Now you're seeing it as AI is this dominating advantage and China's going to use it against us. So our mind is sitting inside of this literally psychological superposition of both seeing AI as controllable and uncontrollable at the same time, which is a contradiction that we don't even acknowledge.

And I argue is the center of the fundamental thing that has to happen, which is there's really two risks here. There's the risk of building AI, which is the uncontrollability catastrophes, all the stuff we've been laying out. And then essentially the risk of not building AI. And the narrow path is how do you build AI and not build AI at the same time?

But the interesting thing about loss of control is that it's the fear of everybody losing that is suddenly bigger than the fear of me losing it to you. And so it's the kind of lose-lose quadrant of the matrix of the prisoner's dilemma. And it was pointed out to me though, that this is where the ego, religious intuition of people building AI actually comes into play because there are some people who are building AI who say, "Well, if humanity gets wiped out, but we created a digital God that we didn't know how to control, but then continues, and I was the one who built it. Well, that's not a zero or negative infinity in my game three matrix, that's not a loss. That's actually not a bad scenario."

So the worst case scenario is that we all died, but then we had this AI that took over. Now, I'm not saying that this is a good way to think. I think this is incredibly dangerous way to think, but given the fact that the people who are building AI feel that "it's inevitable" that if I don't do it, I'll lose to the guy that will.

They start to develop these weird, I think, belief systems that enable them to stay sane on a daily basis. That I think is super, super, super dangerous. And I'm very interested in how... I mean, one of the reasons I'm so interested in loss of control is because I think it really does create the conditions as unlikely as you already said correctly, that it is that some kind of agreement would ever happen.

It creates the basis for a commitment to find that space, even if we don't know what it looks like yet. And I'm not saying that it's likely, but currently we're putting in, I would guess, less than $10 million or a hundred million dollars in the total world to even try to do something that would prevent this obvious thing that literally no one on planet Earth wants who has children, who cares about life and wants to see this thing continue.

Jeremie Harris: The last thing I would want to suggest is that we should not pursue trying to make that option possible. The challenge is there's no such thing as trust but verify in the AI space. And even when there is or when we think there is, we've seen how China behaves in those contexts. The challenge then becomes how do we be honest with ourselves about the difficulty of both of these scenarios? Because even if you can align it as you said-

Edouard Harris: Who is it aligned to? Is there-

Jeremie Harris: Yeah, yeah.

Edouard Harris: Who's the fingers at the keyboard?

Jeremie Harris: Yeah.

Edouard Harris: This is almost just required by logic because if you're neck and neck in developing AI capabilities, well your work on alignment and safety and all those things comes out of your margin of superiority. So it's only like if you have a significant margin that you can put that amount of work into aligning. Now-

Tristan Harris: Aligning safety, preventing jailbreaks.

Edouard Harris: That's it.

Tristan Harris: Preventing all the crazy things we've been talking about.

Edouard Harris: That's it. And so how do you create that margin? Well, there's two ways You got this. Either you race ahead faster, which brings you closer to that potential singularity point, which is dangerous, or you degrade the speed of the adversary or you do both.

Jeremie Harris: The challenge too is a lot of this is kind of moot to some degree because of the security situation in the frontier labs. So here's a scenario that is absolutely the default path that I don't think enough people are internalizing as the default path. We have a Western lab that gets close to building superintelligence. And they think internally, our loss of control measure is tight. Do we think we have a 20% chance of losing control 30?

These are the kinds of conversations that absolutely will be happening. So we get really close and then all of a sudden a Chinese cyber attack or combination of cyber and insider threat or whatever steals the critical model weights, right? Those numbers that form the artificial brain that is so smart here and we don't even know that it's happened. It gets stolen-

Tristan Harris: This is one of the scenarios that's in AI 2027, right? We did a podcast on that before.

Jeremie Harris: Right. Right, exactly. And this is absolutely... it's absolutely correct. So that being the default scenario, this suggests that the very first thing we ought to be doing is securing our critical infrastructure against exactly this sort of thing, or the game theory is just not on our side.

Tristan Harris: There could very be a situation where they're training the model and it's the exact scenario that you're speaking to. And one of the craziest parts of it's, we wouldn't really know. How would they know that their model has been stolen? Which is one of the problems of this sort of verification, international treaties, if one is sabotaging the other, we wouldn't know.

Second thing I wanted to mention is what you're speaking to is the same as Dan Hendricks and Eric Schmidt's paper on, I think they're called mutually assured AI malfunction. That the thing that will stabilize the sort of risk environment is that if you know that I know that and you know that I know that I can sabotage your data centers and you can sabotage mine.

The question is, can we create a stable environment there so that we're not in some kind of one party takes an action, then it escalates into something else? That is also a lose-lose scenario that we have to avoid. Obviously we are not here to try to fearmonger, we're trying to lay out some clear set of facts. If we are clear-eyed, we always say in our work, clarity creates agency. If we can see the truth, we can act. What are some of the clear responses that you want people to be taking? How can people participate in this going better?

Jeremie Harris: Well, one of the key things is we absolutely need better security for all of our AI critical infrastructure in particular to give us optionality heading into this world where we're going to need some kind of arrangement with China. It's going to look like something probably won't be a treaty, but yeah, that's one piece.

We definitely need a lot of emphasis on loss of control and how to basically build systems that are less likely to fall into this trap. How smart can we make systems before that becomes a critical issue is itself an interesting question. And so I think that there's no win without both security and safety and alignment. We have to keep in mind that China exists as we do that.

Edouard Harris: Yeah, there's a sequence of stuff you have to do for this to go well, and security is actually the first. Which is kind of nice because regardless of whether your threat model is loss of control or China does it before us, that security is helpful and supportive in that. So everyone can get on the table on that.

The second thing is, of course, you have to solve alignment, which is a huge, huge open problem, but you have to do that for this to go well. And then the third thing is you have to solve for oversight of these systems, whose fingers are at the keyboards, and can you have some meaningful democratic oversight over that? And we actually go into this in a bit more detail in our most recent report on America's Superintelligence Project.

Tristan Harris: Obviously, this is going to take a village of everybody and grateful that you've been able to frame the issues so clearly and be early on this topic at waking up some of the interventions that we need. Thank you so much for coming on the show.

Jeremie Harris: Thanks so much, Tristan. It's been great.

Share

Discussion about this episode

User's avatar