
In 2023, researcher Daniel Kokotajlo left OpenAI—and risked millions in stock options—to warn the world about the dangerous direction of AI development. Now he’s out with AI 2027, a forecast of where that direction might take us in the very near future.
AI 2027 predicts a world where humans lose control over our destiny at the hands of misaligned, super-intelligent AI systems within just the next few years. That may sound like science fiction but when you’re living on the upward slope of an exponential curve, science fiction can quickly become all too real. And you don’t have to agree with Daniel’s specific forecast to recognize that the incentives around AI could take us to a very bad place.
We invited Daniel on the show this week to discuss those incentives, how they shape the outcomes he predicts in AI 2027, and what concrete steps we can take today to help prevent those outcomes.
Daniel Kokotajlo: OpenAI, Anthropic, and to some extent GEM are explicitly trying to build superintelligence to transform the world. And many of the leaders of these companies, many of the researchers at these companies, and then hundreds of academics and so forth in AI have all signed a statement saying this could kill everyone. And so we've got these important facts that people need to understand. These people are building superintelligence. What does that even look like and how could that possibly result in killing us all? We've written this scenario depicting what that might look like. It's actually my best guess as to what the future will look like.
Tristan Harris: Hey, everyone, this is Tristan Harris.
Daniel Barcay: And this is Daniel Barcay. Welcome to Your Undivided Attention.
Tristan Harris: So a couple of months ago, AI researcher and futurist Daniel Kokotajlo and a team of experts at the AI Futures Project released a document online called AI 2027, and it's a work of speculative futurism that's forecasting two possible outcomes of the current AI arms race that we're in.
Daniel Barcay: And the point was to lay out this picture of what might realistically happen if the different pressures that drove the AI race all went really quickly and to show how those different pressures interrelate. So how economic competition, how geopolitical intrigue, how acceleration of AI research, and the inadequacy of AI safety research, how all those things come together to produce a radically different future that we aren't prepared to handle and aren't even prepared to think about.
Tristan Harris: So in this work, there's two different scenarios and one's a little bit more hopeful than the other, but they're both pretty dark. I mean, one ends with a newly empowered, super intelligent AI that surpasses human intelligence in all domains and ultimately causing the end of human life on earth.
Daniel Barcay: So, Tristan, what was it like for you to read this document?
Tristan Harris: Well, I feel like the answer to that question has to start with a deep breath. I mean, it's easy to just go past that last thing we just read, right? It is just ultimately causing the end of human life on earth. And I wish I could say that this is total embellishment, this is exaggeration, this is just alarmism, Chicken Little, but being in San Francisco talking to people in the AI community and people who have been in this field for a long time, they do think about this in a very serious way.
I think one of the challenges with this report, which I think really does a brilliant job of outlining the competitive pressures and the steps that push us to those kinds of scenarios, I think the thing for most people is when they hear the end of human life on earth, they're like, "What is the AI going to do? It's just a box sitting there computing things. If it's going to do something dangerous, don't we just pull the plug on the box?" And I think that's what's so hard about this problem is that the ways in which something that is so much smarter than you could end life on earth is just outside of you.
Imagine chimpanzees birthed a new species called homo sapiens and they're like, "Okay, well this is going to be like a smarter version of us, but what's the worst thing it's going to do? It's going to steal all bananas?" And you can't imagine computations, semiconductors, drones, airplanes, nuclear weapons. From the perspective of a chimpanzee, your mind literally can't imagine past someone taking all the bananas. So I think there's a way in which this whole domain is fraught with just a difficulty of imagination and also of kind of not dissociating or delegitimizing or nervous laughtering or kind of bypassing a situation that we have to contend with. Because I think the premise of what Daniel did here is not to just scare everybody, it's to say if the current path is heading this direction, how do we clarify that so much so we can choose a different path?
Daniel Barcay: Yeah. When you're reading a report that is this stark and this scary, it's possible to have so many different reactions to this. "Oh my God, is it true? Is it really going to move this fast? Are these people just sort of in sci-fi land?" But I think the important part of sitting with this is not is the timeline right? It's how all these different incentives, the geopolitical incentives, the economic pressures, how they all come together. And we could do a step-by-step of the story, but there's so many different dynamics. There's dynamics of how AI accelerates AI research itself, and dynamics of how we lean more on AI to train the next generation of AI and we begin to lose understandability and control on AI development itself. There's geopolitical intrigue on how China ends up stealing AIs from the US or how China ends up realizing that it needs to centralize its data centers where the US has more lax security standards.
We recognize that this can be a lot to swallow and it can really seem like a work of pure fiction or fantasy, but these scenarios are based on real analysis of the game theory and how different people might act. But there are some assumptions in here. There are critical assumptions that decisions that are made by corporate actors or geopolitical actors are really the decisive ones, that citizens everywhere may not have a meaningful chance to push back on their autonomy being given away to super intelligence. And AI timelines are incredibly uncertain, and the pace of AI 2027 as a scenario is one of the more aggressive predictions that we've seen. But to reiterate, the purpose of AI 2027 was to show how quickly this might happen. Now Daniel himself has already pushed back his predictions by a year, and as you'll hear in the conversation, he acknowledges the uncertainties here and he sees them as far from being a sure thing.
Tristan Harris: I think that Daniel and CHT really share a deep intention here, which is that if we're unclear about which way the current tracks of the future take us, then we'll lead to an unconscious future. And in this case, we need to paint a very clear picture of how the current incentives and competitive pressures actually take us to a place that no one really wants, including between the US and China. And we at CHT hope that policymakers and titans of industry and civil society will take on board the clarity about where these current train tracks are heading and ask, do we have the adequate protections in place to avoid the scenario? And if we don't, then that's what we have to do right now.
Daniel, welcome to Your Undivided Attention.
Daniel Kokotajlo: Thanks for having me.
Tristan Harris: So just to get started, could you just let us know a little bit about who you are and your background?
Daniel Kokotajlo: Prior to AI Futures, I was working at OpenAI doing a combination of forecasting, governance and alignment research. Prior to OpenAI, I was at a series of small research nonprofits thinking about the future of AI. Prior to that, I was studying philosophy in grad school.
Tristan Harris: I just want to say that when I first met you, Daniel, at a community of people who work on future AI issues and AI safety, you were working at OpenAI at the time, and I thought to myself, I think you even said actually when we met, because you said that basically if things were to go off the rails, you would leave OpenAI and you would basically do whatever would be necessary for this to go well for society and humanity. And I consider you to be someone of very deep integrity because you ended up doing that and you forfeited millions of dollars of stock options in order to warn the public about a year ago in a New York Times article. And I just wanted to let people know about that in your background that you're not someone who's trying to get attention, you're someone who cares deeply about the future. Do you want to talk a little bit about that choice, by the way? Was that hard for you to leave?
Daniel Kokotajlo: I don't think that I left because things had went off the rails so much as I left because it seemed like the rails that we were on were headed to a bad place. Then in particular, I left because I thought that something like what's depicted in AI 2027 would happen, and that's just basically the implicit and in some cases explicit plan of OpenAI and also to some extent these other companies. And I think that's an incredibly dangerous plan.
And so there was an official team at OpenAI whose job it was to handle that situation and who had a couple years of lead time to start prepping for how they were going to handle that situation. And it was full of extremely smart, talented, hardworking people. But even then I was like, "This is just not the way. I don't think they're going to succeed." I think that the intelligence explosion is going to happen too fast and it will happen too soon before we have understood how these AIs think, and despite their good intentions and best efforts, the superalignment team is going to fail.
And so rather than stay and try to help them, I made the somewhat risky decision to give up that opportunity to leave and then have the ability to speak more freely and do the research that I wanted to do. And that's what AI 2027 was basically, was a attempt to predict what the future is going to be like by default, an attempt to sort of see where those rails are headed, and then to write it up in a way that's accessible so that lots of people can read it and see what's going on.
Daniel Barcay: Before we dive into AI 2027 itself, it's worth mentioning that in 2021 you did a sort of mini unofficial version of this where you actually predicted a whole bunch of where we would be at now and in 2026 with AI, and quite frankly, you were spot on with some of your predictions. You predicted in 2024 we'd reach diminishing returns on just pure scaling with compute, and we'd have to look at models changing architectures. And that happened. You predicted we'd start to see some emerging misalignment, deception. That happened. You predicted we'd see the rise of entertaining chatbots and companion bots as a primary use case, and that emerged as the top use case of AI this year. So what did you learn from that initial exercise?
Daniel Kokotajlo: Well, it emboldened me to try again with AI 2027. So the world is blessed with a beautiful, vibrant, efficient market for predicting stock prices, but we don't have an efficient market for predicting other events of societal interest, for the most part. Presidential elections maybe are another category of something where there's a relatively efficient market for predicting the outcomes of them. But for things like AGI timelines, there's not that many people thinking about this and there's not really a way for them to make money off of it, and that's probably why there's not that many people thinking about this. So it's a relatively small niche field. I think the main thing to do as forecasters, when you're starting from zero, first thing you want to do is collect data and plot trend lines and then extrapolate those trend lines. And so that's what a lot of people are doing, and that's a very important foundational thing to be doing. And we've done a lot of that too at AI Futures Project.
Daniel Barcay: So the trends of how much compute's available, the trends of how many problems can be solved. What other kinds of trends?
Daniel Kokotajlo: Well, mostly trends like compute, revenue for the companies, maybe data of various kinds, and then most importantly, benchmark scores on all the benchmarks that you care about. So that's like the foundation of any good futures forecast is having all those trends and extrapolating them.
Then you also maybe build models and you try to think like, "Well, gee, if the AI start automating all the AI research, how fast will the AI research go? Let's try to understand that. Let's try to make an economic model for example of that acceleration. We can make various qualitative arguments about capability levels and so forth." That literature exists. But then because that literature is so small, I guess not that many people had thought to try putting it all together in the form of a scenario.
Before, a few people had done some things sort of like this, and that was what I was inspired by, so I spent two months writing this blog post, which was called What 2026 Looks Like, where I just worked things forward year by year. I was like, what do I think is going to happen next year? Okay, what about the year after that? What about the year after that? And of course it becomes less and less likely each new ... Every new claim that you add to the list lowers the overall probability of the conjunction being correct, but is sort of doing a simulated rollout or a simulation of the future, there's value in doing it at that level of detail and that level of comprehensiveness. I think you learn a lot by forcing yourself to think that concretely about things.
Daniel Barcay: Your first article.
Daniel Kokotajlo: My first article. And so then that was what emboldened me to try again and this time to take it even more seriously to hire a whole team to help me, a team of expert forecasters and researchers, and to put a lot more than two months worth of effort into it and to make it presented in a nice package on a website and so forth. And so, fingers crossed, this time will be very different from last time and the methodology will totally fail and the future will look nothing like what we predicted because what we predicted is kind of scary.
Tristan Harris: So any work of speculative fiction, the AI 2027 scenario is based on extrapolating from a number of trends and then making some key assumptions, which the team built into their models. And we just wanted to name some of those assumptions and discuss what happens based on those assumptions.
Daniel Kokotajlo: First, just assume that the AIs are misaligned because of the race dynamics. So because these things are black box neural nets, we can't actually check reliably whether they're aligned or not, and we have to rely on these more indirect methods like our arguments. We can say, "It was a wonderful training environment, there was no flaws in the training environments, therefore it must have learned the right values."

Daniel Barcay: So how dideven get here? How did it even get to corporations running as fast as possible and government's running as fast as possible? It all comes down to the game theory. The first ingredient that gets us there is companies just racing to beat each other economically. And the second ingredient is countries racing to beat each other and making sure that their country is dominant in AI. And the third and final ingredient is that the AIs in that process become smart enough that they hide their motivations and pretend that they're going to do what programmers train them to do or what customers want them to do, but we don't pick up on the fact that that doesn't happen until it's too late. So why does that happen? Here's Daniel.
Daniel Kokotajlo: So given the race dynamics, where they're trying as hard as they can to beat each other and they're going as fast as they can, I predict that the outcome will be AIs that are not actually aligned, but are just playing along and pretending, and also assume that the companies are racing as fast as they possibly can to make smarter AIs and to automate things with AIs and to put AIs in charge of stuff and so forth. Well then we've done a bunch of research and analysis to predict how fast things would go, the capability story, the takeoff story.
Daniel Barcay: You start off talking about 2025 and how there's just these sort of stumbling, fumbling agents that do some things well but also fail at a lot of tasks and how people are largely skeptical of how good they'll become because of that, or they like to point out their failures. But little by little, or I should actually say very quickly, these agents get much better. Can you take it from there?
Daniel Kokotajlo: Yep. So we're already seeing the glimmerings of this, right? §After training giant transformers to predict text, the obvious next step is training them to generate text and training them ... I mean, the obvious next step after that is training them to take actions to browse the web, to write code, and then debug the code and then rerun it and so forth. And basically turning them into a sort of virtual co-worker that just runs continuously. I would call this an agent. So it's an autonomous AI system that acts towards goals on its own, without humans in the loop, and has access to the internet and has all these tools and things like that.
The companies are working on building these, and they already have prototypes, which you can go read about, but they're not very good. AI 2027 predicts that they will get better at everything over the next couple years as the companies make them bigger, train them on more data, improve their training algorithms and so forth. So AI 2027 predicts that by early 2027, they will be good enough that they will basically be able to substitute for human programmers, which means that coding happens a lot faster than it currently does. When researchers have ideas for experiments, they can get those experiments coded up extremely quickly and they can have them debugged extremely quickly and they're bottlenecked more on having good ideas and on waiting for the experiments to run.
Daniel Barcay: And this seems really critical to your forecast that no matter what the gains are in the rest of the world for having AIs deployed, that ultimately the AI will be pointed at the act of programming and AI research itself because those gains are just vastly more potent. Is that right?
Daniel Kokotajlo: This is a subplot of AI 2027, and according to our best guesses, we think that roughly speaking, once you have AIs that are fully autonomous goal-directed agents that can substitute for human programmers very well, you have about a year until you have superintelligence, if you go as fast as possible, as mentioned by that previous assumption.
And then once you've got the superintelligences, you have about a year before you have this crazily transformed economy with all sorts of new factories designed by superintelligences, run by superintelligences, producing robots that are run by superintelligences, producing more factories, etc. And there's this whole sort of robot economy that no longer depends on humans and also is very militarily powerful, and it's designed all sorts of new drones and new weapons and so forth.
So one year to go from the coder to the superintelligence, one year to go from the superintelligence to the robot economy, that's our estimate for how fast things could go if you were going really hard. If the leadership of the corporation was going as fast as they could, if the leadership of the country, like the presidents was going as fast as they could, that's how fast they go.
So yeah, there's this question of how much of their compute and other resources will the tech companies spend on using AI to accelerate AI R&D versus using AI to serve customers or to do other projects. And I forget what we say, but we actually have a quantitative breakdown at AI 2027 about what fraction goes to what, and we're expecting that fraction to increase over time rather than decrease because we think that strategically that's what makes sense. If your top priority is winning the race, then I think that's the breakdown you would do.
Tristan Harris: Let's talk about that for a second. So it's like I'm Anthropic and I can choose between scaling up my sales team and getting more enterprise sales, integrating AI, getting some revenue, proving that to investors, or I can put more of the resources directly into AI coding agents that massively accelerate my AI progress so that maybe I can ship Claude 5 or something like that, signal that to investors and be on a faster sort of ratchet of not just an exponential curve, but a double exponential curve, AI that improves the pace and speed of AI. That's the trade off that you're talking about here, right?
Daniel Kokotajlo: Yeah, basically. So we have our estimates for how much faster overall pace of AI progress will go at these various capability milestones. Of course, we think it's not going to be discontinuous jumps, we think it's going to be continuous ramp up in capabilities, but it's helpful to name specific milestones for purposes of talking about them. So the superhuman coder milestone early 2027, we're thinking something like a 5x boost to the speed of algorithmic progress, the speed of getting new useful ideas for how to train AIs and how to design them.
And then partly because of that speed up, we think that by the middle of the year they would have trained new AIs with additional skills that are able to do not just the coding, but all the other aspects of AI research as well. So choosing the experiments, analyzing the experiments, et cetera. So at that point, you've basically got a company within a company, you still have Open Brain, the company with all their human employees, but now they have something like a hundred thousand virtual AI employees that are all networks together, running experiments, sharing results with each other, et cetera.
Tristan Harris: So we could have this acceleration of AI coding progress inside the lab, but to a regular person sitting outside who's just serving dinner to their family in Kansas, nothing might be changing for them. And so there could be this sense of like, oh, well, I don't feel like AI is going much faster. I'm just a person here doing this. I'm a politician. I'm like, I'm hearing that there might be stuff speeding up inside of an AI lab, but I have zero felt sense of my own nervous system as I breathe the air and live my life, that anything is really changing. And so it's important to name that because there might be this huge lag between the vast exponential sci-fi like progress happening inside of this weird box called an AI company and the rest of the world.
Daniel Kokotajlo: Yeah, I think that's exactly right, and I think that's a big problem. It's part of why I want there to be more transparency. I feel like probably most ordinary people would ... they'd be seeing AI stuff increasingly talked about in the news over the course of 2027, and they'd see headlines about stuff, but their actual life wouldn't change. Basically from the perspective of an ordinary person, things feel pretty normal up until all of a sudden the superintelligences are telling them on their cell phone what to do.
Daniel Barcay: So you described the first part where it's the progress that the AI labs can make is faster than anyone realizes because they can't see inside of it. What's the next step of the AI 2027 scenario after just the private advancement within the AI labs?
Daniel Kokotajlo: There's a couple different subplots basically to be tracking. So there's the capability subplot, which is how good are the AI's getting at tasks? And that subplot basically goes, they can automate the coding in early 2027; in mid 2027, they can automate all the research; and by late 2027, they're superintelligent.
But that's just one subplot. Another subplot is geopolitically what's going on. And the answer to that is in early 2027, the CCP steals the AI from Open Brain so that they can have it too, so they can use it to accelerate their own research. And this causes a sort of soft nationalization/increased level of cooperation between the US government and Open Brain, which is what Open Brain wanted all along. They now have the government as an ally, helping them to go faster and cut red tape and giving them sort of political cover for what they're doing, and all motivated by the desire to beat China, of course. So politically, that's sort of what's going on then.
Then there's the alignment subplot, which is technically speaking, what are the goals and values that they are trying to put into the AI's and is it working? And the answer is no, it's not working. The AI's are not honest and not always obedient and don't have human values always at heart.
Tristan Harris: We going to want to explore that because that might just sound like science fiction to some people 'cause so you're training the AI's and then they're not going to be honest, they're not going to be harmless. Why is that? Explain the mechanics of how alignment research currently works and why even despite deep investments in that area, we're not on track for alignment.
Daniel Kokotajlo: Yeah, great question. So I think that funnily enough, science fiction was often overoptimistic about the technical situation, and in a lot of science fiction, humans are sort of directly programming goals into AIs and then chaos ensues when the humans didn't notice some of the unintended consequences of those goals. For example, they program HAL with "Ensure mission success," or whatever, and then HAL thinks. "I have to kill these people in order to ensure mission success."
So the situation in the real world is actually worse than that because we don't program anything into the AIs, they're giant neural nets. There is no sort of goal slot inside them that we can access and look and see what is their goal. Instead, they're just a big bag of artificial neurons, and what we do is we put that bag through training environments, and the training environments automatically update the weights of the neurons in ways that make them more likely to get high scores in the training environments.
And then we hope that as a result of all of this, the goals and values that we wanted will sort of grow on the inside of the AIs and cause the AIs to have the virtues that we want them to have, such as honesty. But needless to say, this is a very unreliable and imperfect method of getting goals and values into an AI system, and empirically it's not working that well. And the AIs are often saying things that are not just false, but that they know are false and that they know was not what they were supposed to say.
Tristan Harris: But why would that happen exactly? Can you break that down?
Daniel Kokotajlo: Because the goals, the values, the principles, the behaviors that caused the AI to score highest in the training environment are not necessarily the ones that you hoped they would end up with. There's already empirical evidence that that's at least possible for current AIs are smart enough to sometimes come up with this strategy and start executing on it. They're not very good at it, but they're only going to get better at everything every year.
Daniel Barcay: Right. And so part of your argument is that as these systems, as you try to incentivize these systems to do the right thing, but you can only incentivize them to sort of push, nudge them in the right direction, they're going to find these ways, whether it's deception or sandbagging or P-hacking, they're going to find these ways of effectively cheating like humans end up doing sometimes. Except this time, if the model's smart enough, we may not be able to detect that they're doing that and we may roll them out into society before we've realized that this is a problem.
Daniel Kokotajlo: Yes.
Daniel Barcay: And so maybe you can go talk about how your scenario then picks that up and says, what will this do to society?
Daniel Kokotajlo: So if they don't end up with the goals and values that you wanted them to have, then the question is what goals and values do they end up with? And of course we don't have a good answer to that question. Nobody does. This is a bleeding edge new field that is extremely, it's much more like alchemy than science basically. But in AI 2027, we depict the answer to that question being that the AIs end up with a bunch of core motivations or drives that cause them to perform well in the diverse training environment they were given. And we say that those core motivations and drives are things like performing impressive intellectual feats, accomplishing lots of tasks quickly, getting high scores on various benchmarks and evals, producing work that is very impressive, things like that.
So we sort of imagine that that's the sort of core motivational system that they end up with instead of being nice to humans and always obeying humans and being always honest or whatever it is that they were supposed to end up with.
And the reason for this, of course, is that this set of motivations would cause them to perform better in training and therefore would be reinforced. And why would it cause them to perform better in training? Well, because it allows them to take advantage of various opportunities to get higher score at the cost of being less honest, for example.
Daniel Barcay: We explored this theme on our previous podcast with Ryan Greenblatt from Redwood Research. This isn't actually far-fetched. There's already evidence that this kind of deception is possible, that current AIs can be put into situations where they're going to come up with an active strategy to deceive people and then start executing on it, hiding the real intentions both from end users and from AI engineers. Now, they're not currently very good at it yet, they don't do it very often, but AI is only going to get better every year, and there's reason to believe that this kind of behavior will increase.
And when you add on to that, one of the core parts of AI 2027 is the lack of transparency about what these models are even capable of, the massive information asymmetry between the AI labs and the general public, so that we don't even understand what's happening, what's about to be released.
And given all of that, you might end up in a world where by the time this is all clear to the public, by the time we realize what's going on, these AI systems are already wired into the critical parts of our infrastructure, into our economy and into our government, so that it becomes hard or impossible to stop by that point.
Daniel Kokotajlo: So anyhow, long story short, you end up with these AIs that are broadly superhuman and have been put in charge of developing the next generation of AI systems, which will then develop the next generation and so forth. And humans are mostly out of the loop in this whole process or maybe sort of overseeing it, reading the reports, watching the lines on the graphs go up, trying to understand the research but mostly failing because the AIs are smarter than them and are doing a lot of really complicated stuff really fast.
Tristan Harris: I was going to say, I think that's just an important point to be able to get. It's like we move from a world where in 2015 OpenAI is like a few dozen people who are all engineers building stuff. Humans are reviewing the code that the other humans at OpenAI wrote, and then they're reading the papers that other researchers at OpenAI wrote. And now you're moving to a world where more code is generated by machines than all the human researchers could ever even look at because it's generating so much code so quickly, it's generating new algorithmic insights so quickly, it's generating new training data so quickly, it's running experiments that humans don't know how to interpret. And so we're moving into a more and more inscrutable phase of the AI development sort of process.
Daniel Kokotajlo: And then if the AIs don't have the goals that we want them to have, then we're in trouble because then they can make sure that the next generation of AIs also doesn't have the goals that we want them to have, but instead has the goals that they want them to have.
Daniel Barcay: For me, it's what's in AI 2027 is a really cogent unpicking of a bunch of different incentives: geopolitical incentives, corporate incentives, technical incentives around the way AI training works and the failures of us imagining that we have it under control, and you weave those together. Whether AI 2027 as a scenario is the right scenario and is the scenario we're going to end up in, I think plenty of people can disagree, but it's an incredibly cogent exposition of a bunch of these different incentive pressures that we are all going to have to be pushing against and how those incentive pressures touch each other. How the geopolitical incentives touch the corporate incentives, touch the technical limitations, and making sure that we change those incentives to end up in a good future.
Tristan Harris: And at the end of the day, those geopolitical dynamics, the competitive pressures on companies, this is all coming down to an arms race, like a recursive arms race, a race for which companies employ AI faster into the economy, a race between nations for who builds AGI before the other one, a race between the companies of who advances capabilities and uses that to raise more venture capital. And just to sort of say a through line of the prediction you're making is the centrality of the race dynamic that sort of runs through all of it.
So we just want to speak to the reality for a moment that all of this is really hard to hear, and it's also hard to know how to hold this information. I mean, the power to determine these outcomes resides in just a handful of CEOs right now. And the future is still unwritten, but the whole point of AI 2027 is to show us what would happen if we don't take some actions now to shift the future in a different direction. So we asked Daniel what some of those actions might look like.
So as part of your responses to this, what are the things that we most need that could avert the worst outcome in AI 2027?
Daniel Kokotajlo: Well, there's a lot of stuff we need to do. My go-to answer is transparency for the short term. So I think in the longer term, right now, again, AI systems are pretty weak. They're not that dangerous right now. In the future, when they're fully autonomous agents capable of automating the whole research project, that's when things are really serious and we need to do significant action to regulate and make sure things go safe. But for now, the thing I would advocate for is transparency. So we need to have more requirements on these companies to be honest and disclose what sort of capabilities their AI systems have, what their projections are for future AI systems capabilities, what goals and values they're attempting to train into the models, any evidence they have pertinent to whether their training is succeeding at getting those goals and values in things like that, basically.
Whistleblower protections, I think I would also throw on the list. So I think that one way to help keep these companies honest is to have there be an enforcement mechanism basically for dishonesty. And I think one of the only enforcement mechanisms we have is employees speaking out basically. Currently, we're in a situation where companies can be basically lying to the public about where things are headed and the safety levels of their systems and whether they've been upholding their own promises, and one of the only recourses we have is employees deciding that that's not okay and speaking out about it.
Tristan Harris: Yeah. Could you actually just say one more specific note on whistleblower protections? What are the mechanisms that are not available that should be available specifically?
Daniel Kokotajlo: There's a couple different ... One type of whistleblower production is designed for holding companies accountable when they break their own promises or when they mislead the public. There's another type of thing which is about the technical safety case. So I think that we're going to be headed towards a situation where non-technical people will just be sort of completely out of their depth at trying to figure out whether the system is safe or not, because it's going to depend on these complicated arguments that only alignment researchers will know the terms in.
So for example, previously I mentioned how there's this concern that the AIs might be smart and they might be just pretending to be aligned instead of actually aligned. That's called alignment faking. It's been studied in the literature for a couple years now. Various people have come up with possible counter strategies for dealing with that problem. And then there's various flaws in those counter strategies and various assumptions that are kind of weak. And so there's a literature challenging those assumptions.
Ultimately, we're going to be in a situation where the AI company is automating all their research and the president is asking them, "Is this a good idea? Are we sure we can trust the AIs?" And the AI company is saying, "Yes, sir. We've dotted our I's and crossed our T's or whatever, and we are confident that these AIs are safe and aligned." And then the president of course has no way to know himself. He just has to say, "Well, okay, show me your documents that you've written about your training processes and how you've made sure that it's safe," but he can't evaluate it himself. He needs experts who can then go through the tree of arguments and rebuttals and be like, "Was this assumption correct? Did you actually solve the alignment faking problem, or did you just appear to solve it? Or are you just putting out hot air that not even close to solving it?"
And so we need technical experts in alignment research to actually make those calls, and there are very few sets of people in the world, and most of them are not at these companies. And the ones who are at the companies have a sort of conflict of interest or a bias. The ones at the company that's building the thing are going to be motivated towards thinking things are fine. And so what I would like is to have a situation where people at the company can basically get outside help at evaluating this sort of thing, and they can be like, "Hey, my manager says this is fine and that I shouldn't worry about it, but I'm worried that our training technique is not working. I'm seeing some concerning signs, and I don't like how my manager is sort of dismissing them but the situation is still unclear and it's very technical."
So I would like to get some outside experts and talk it over with them and be like, "What do you think about this? Do you think this is actually fine? Or do you think this is concerning?" So I would like there to be some sort of legally protected channel by which they can have those conversations.
Tristan Harris: So I think what Daniel's speaking to here is the complexity of the issues like AI itself is inscrutable, meaning the things that it does and how it works is inscrutable. But then as you try to explain to presidents or heads of state debates about is the AI actually aligned, that it's going to be inscrutable to policymakers even because the answers rely on such deep technical knowledge. So on the one hand, yes, we need whistleblower protections and we need to protect those who have that knowledge and can speak for the public interest to do so with the most freedom as possible, that they don't have to sacrifice millions of dollars of stock options. And Senator Chuck Grassley has a bill that's being advanced right now that CHT supports. We'd like to see these kinds of things, but this is just one small part of a whole suite of things that need to happen if we want to avoid the worst case scenario that AI 2027 is mapping.
Daniel Barcay: Totally. And one key part of that is transparency, right? It's pretty insane that for technology moving this quickly, only the people inside of these labs really understand what's happening until day one of a product release where it suddenly impacts a billion people.
Tristan Harris: Yeah. So just to be clear, you don't have to agree with the specific events that happen in AI 2027 or whether the government's really going to create a special economic zone and start building robot factories in the middle of the desert covered in solar panels; however, are the competitive pressures pushing in this direction?
Daniel Barcay: Yeah.
Tristan Harris: And the answer is 100% clear that they are pushing in this direction. We can argue about governments that are probably not going to take responses like that because there's been a lot of institutional decay and less capable responses that can happen there; however, the pressures for competition and the power that is conferred by AI do point in one direction. I think AI 2027 is hinting at what that direction is. So I think if we take that seriously, we have a chance of steering towards another path. We tried to do this in the recent TED Talk. If we can see clearly, clarity creates agency, and that's what this episode was about, it's what Daniel's work is about, and we're super grateful to him and his whole team, and we're going to do some future episodes soon on loss of control and other ways that we know that AI is less controllable than we think. Stay tuned for more.
Share this post