OpenAI Launches Codex: An Autonomous Programming Agent

OpenAI just launched Codex, a brand-new coding agent that can build features and fix bugs autonomously. We’ve been testing it at Every for a few days, and I’m impressed. I invited Alexander Embiricos, a member of the OpenAI product staff responsible for Codex, to demo the agent live on a special edition of AI & I. We talk through: - What Codex is and how it works. Codex’s UI allows developers to see the list of tasks the agent is working on, how many lines were changed for each, and the status of the PR. It’s built for the senior software engineer who wants to delegate and review tasks efficiently. - How OpenAI is thinking about agents. Codex is one piece of a unified super-assistant OpenAI wants to eventually build—an agent that helps users easily get things done by selecting the right tools for them behind the scenes. - Why an “abundance mindset” is best for interacting with agents. Codex is designed to allow users to delegate many tasks at once without getting caught up in the details. This lets you point an abundance of agents at a specific task, like a difficult bug—it’s worth it even if only one of them succeeds. - OpenAI’s vision for the future of programming. In the future developers will probably spend less time writing routine code and more time guiding agents, reviewing their work, and making strategy decisions. Programming will become more social, letting teams easily delegate multiple tasks at once, allowing people to focus on ideas and collaboration instead of routine coding. Timestamps: Introduction: 00:00:52

Published: Published May 16, 2025
Uploaded: Uploaded Jun 12, 2026
File type: POD
Queried: 00
Source: share.transistor.fm

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:52

[00:00] Huge news! OpenAI is launching a new coding agent. It's called Codex. Codex is a web-based software engineering agent that's designed to work pretty much autonomously, so you can have it working on many different features and bugs in parallel while you watch. I've been using Codex for a couple days, and it's pretty great. So I invited Alexander Embiurikos, a member of the product staff at OpenAI who's responsible for Codex, to come on the show and talk to us about it. We're going to demo Codex together and talk through all of the product decisions that led to its release. [00:28] We walk through Codex screen by screen, and he'll talk to us about how OpenAI is thinking about agents and what OpenAI's vision is for the future of programming. [00:36] Let's jump in. [00:52] Alexander, welcome to the show. [00:54] Hey, thanks for having me. [00:55] So for people who don't know, you are a member of the product team at OpenAI, and you are one of the people responsible for building Codex, which is a new programming agent that is launching today. Well, we're recording today. [01:07] Technically, we're recording yesterday, but it's launching today, Friday. I'm super excited, but I got to try it. I have a review coming out along with this podcast, and I just wanted to go through it with you and understand all the things that you put into it and all the ways that you're thinking about it. So thank you for coming on the show. Yeah, excited to be here, and thank you for being an early tester. [01:26] Awesome. So what I want to do is just like give people right off the bat a sense for like actually what it looks like. And like we can sort of talk about it together. So this is Codex. Terminate processes. Great. This is Codex. And it's a coding agent, right? So I can type in a task like, please replace the headline on the homepage with Codex is out now. And I can just press code.

1:56-3:40

[01:56] like an agent experience. Like it just adds it as a task. So it's like, you're, it's basically like built to make me just like, [02:04] create a bunch of tasks and then not look at them, more or less. That's how I feel about it. So tell me about this screen and how you thought about this and why you did it like this. [02:15] Yeah, so I'm so excited that you're screen sharing this and showing people so we get to nerd out about all the details. But, you know, for anyone listening, basically Codex is a cloud-based software engineer that can work on many tasks in parallel. And there's a ton of AI tools out there, I feel like. [02:32] Programming with AI is completely different now than it was like two years ago. And I'm [02:36] this form factor that we're looking at is kind of the beginning of us thinking about what is it going to look like even like a year from now today um and [02:43] The big shift is... [02:44] Today, when people are accelerated working with AI, [02:48] uh, [02:49] a lot of it is like very, like a very tight feedback loop. It's like very collaborative, like a lot of like really good tab completion, right? And like really good chat. And I'm, [02:57] sort of where we see things going is we have these really awesome reasoning models. [03:03] And we want to give them time to think. [03:05] And we want to let them use tools safely so they can do more stuff, for example, like running commands to execute tests to be more sure of their changes. [03:12] and then iterate. And so we basically want to give the agent its own computer to do work in. [03:19] And so as we started thinking about that, you kind of realize, okay, this feels a bit like delegation. [03:26] And I'm sure we'll get into this, but we tried a bunch of different form factors for what it would feel like to delegate to an agent, including maybe the most obvious one, which is just talking to ChatGPT and having it do stuff. But the short version of it is what we realized is that

3:41-4:59

[03:41] Engineers wanted like this really functional tool to be like very efficient. [03:45] in how they delegate. And so the thing you see here, it's very structured. You've got the list of tasks that you have. You can see how many lines were changed in each. You can see what the status of the PR is, which is ultimately the thing you're going for. [03:56] And so this ended up just being a thing that we iterated our way to. [04:01] that we thought was the best first version to ship. [04:03] But actually, like, you know, I would love it if we end up talking a bit about what it might look like in ChatipT as well, because we definitely plan to bring it there. [04:09] I would love to do that, and I have a bunch of follow-ups on that specific point about the chat GPT versus what this is point. So to give people a sense for how this actually looks, I did a couple things yesterday. I actually shipped a feature on one of our products that I never... [04:28] I've never coded in. I've never touched that repo. And I shipped a feature to Prod yesterday. Kieran is the GM of that product, and he's a really, really talented engineer who's also an early tester of Codex. He was watching me the whole time to make sure I didn't mess anything up, but it was really cool. And basically, to show people what it looks like, [04:50] We have this UI in... [04:54] This is off-tack one. [04:56] Ah.

5:02-6:32

[05:02] You're looking for the diff, right? [05:03] Yeah, where is it? [05:05] You should be able to click. There you go. Yeah. Nice. By the way, this UI that Dan's showing us has been editing, changing massively every single day as we race towards launch. So thanks again for testing early. This may not be what you actually see or depending on when you're listening to this, it definitely is not going to be what you see. But for now, for people who are listening, basically, I clicked into one of the tasks and you can see basically for the feature I wanted to build, we have a view that I wanted to create a persistent collapsed state. [05:35] to be able to collapse a particular view. And I just said, do that, basically. And you can kind of see a log of all the things it thought while it was doing the task. And then it gives me a summary of what it did. I can see a diff. And I can also just push that right to GitHub. And one of the things that I notice is the summary is very concise. And in talking to Kieran about the code that it generates, [06:05] um, [06:06] the code seems very minimal or like terse. Can you tell me a little bit about how you thought about making this thing code and talk about what it makes? [06:17] Yeah. [06:19] It's a great question. So, uh, [06:22] Codex, the agent, runs on a custom model that we train just for this product. [06:26] called Codex 1. Codex 1 is a version of O3 that's optimized for real-world software engineering.

6:32-8:03

[06:32] But not only optimized for real-world software engineering, but also optimized for this form factor, which is, hey, it's going to go off and do a bunch of work, and then you're going to get [06:40] a diff back. And, you know, once you're using this a lot, actually, you're going to be getting many diffs a day back. And, you know, I'm sure this thing we talk about a lot with even human software engineers, like, would you rather review, like, 10, 100 line diffs? [06:53] or 20, 50 line diffs or 1,000 line PR. [06:57] Definitely the thing is you don't want to review the 1,000 line PR. [07:01] As we trained this model, we both wanted to make sure it was great at coding. So the model slightly outperforms O3 actually on evals like Sweebench. But more importantly, actually, we wanted it to produce code that was really mergeable and then talk about the work that it did. [07:16] in a way that was like actually really reviewable for the human reviewer. So we put a lot of effort into style here, you know, making sure it's not doing things like adding extraneous comments, making sure that the style that it's using is actually like the style in your code base and not like its own style that it has and thinks is correct. Okay, and so getting to your point about the PR descriptions, you know, early versions we've had of tools like this or models that did [07:46] bit of the story, like in our UI, we would like show you kind of the model's thoughts first, and then we would show the diff underneath. And whenever like a full-time software engineer was trying to use this a lot, they were always like, man, I just want to see the diff. Like, I don't want to read the description until I see the diff. And if the diff is good, then maybe I'll look at the description. Um,

8:04-9:35

[08:04] And so we kind of ended up realizing like, hey, there's actually like kind of a spec for what a good PR description looks like. And that's like pretty concise and only maybe explaining like at a high level what it is, the non-obvious parts. Like you definitely don't want to duplicate content that's in the diff. [08:16] Um, [08:18] Then the other thing we realized is that [08:20] Because it didn't run on your computer, you didn't necessarily see each iterative step. And so you really want to know, [08:25] how validated is this change and how did the model try to validate the change. And so you can see here on the left side, the model is also describing the testing that it ran. [08:33] Um, [08:34] And here, we put a lot of effort into, first of all, just making sure that it's never hallucinating, but also... [08:39] making sure that it tells you when tests passed and when tests failed, which is super important in a concise way. And then lastly, perhaps the thing I'm most excited about, if you mouse over that little terminal icon next to the testing, [08:51] Uh, [08:52] Yeah, so it's sort of like the last thing on the third line there. [08:56] So the little icon is like a carrot. Yeah. So we'll actually, the model learns how to cite its output. [09:01] And so it's not just saying like, hey, here's a test I ran. You know, in the case of tests that passed, which it didn't hear, you know, it's not saying like, hey, the test passed. Believe me. [09:12] It's saying, hey, the test passed. Here's like a deterministic citation that we did to actually pull the part from like the logout. So you can review it yourself and be super sure. And in this case, the test it tried to run failed. And it's kind of telling you this because maybe you would want to run it yourself or like upgrade the environment it's operating in to so that those tests can be run. And it's giving you the output there so you can like go and review it and improve it.

9:35-11:13

[09:35] That's really interesting. This particular repo, so I actually, the one we're looking at, I kicked off the collapse state task in the wrong repo at first. So this is the wrong repo. This is an internal app we built, and so it does not have any tests. [10:05] But yeah, okay, that makes a lot of sense. One of the things that I notice about... [10:10] this... [10:11] that feels different from like a Devon, for example, is I don't see this, it has an environment that it uses, that it sets everything up and it runs a test and all that kind of stuff, but it doesn't have a browser, for example. So it's not going and like logging in to the app to check if it's working or not. Like, tell me about that decision. Why'd you do it that way? [10:29] Yeah, I would say that's more of a just like sequencing decision rather than like a decision. Like ultimately, kind of like the thing that we view ourselves as building like super long term is just like one super assistant. And it has a bunch of tools and you don't even have to decide. [10:46] you know, if you even want to use code mode, or if you want to use, like, I don't know, newsletter publisher mode, you just have one assistant, ChatGPT, and you can just ask it for stuff, and it'll use the appropriate tools. And it'll also just like answer quickly, if that's the right thing, or it'll like do work, if that's the right thing. So, [11:01] Long-term, that's where we're trying to get to, just, like, one thing with all the tools. But what we're starting with is kind of, like, hey, we want to ship a research preview, we want to, like, iteratively deploy this out in, like, small steps as capabilities grow, and we're going to start, like,

11:14-13:08

[11:14] the smallest, most constrained thing possible. So, for instance... [11:18] It doesn't take multimodal inputs yet. We'd obviously love to add that. It doesn't have a browser. That would obviously be useful for validating funding changes. It doesn't even have network access. So the way that it runs is, so when you kick off a task, you're going to have a task. [11:32] We set up a container. [11:34] pull the repo into that container. And then with network access, we run some set of commands that you wrote. So it's not the agent running those commands. It's you writing those commands. And so that can pull in any dependencies, like NPMI, NPM install, or whatever you need. And then actually we cut off internet access. And I think we updated the logs recently to show that. So if you were to click into logs and scroll to the top, you would see internet access all the way at the top. [11:57] And then you would see no Internet access, [12:01] Yeah. So we ran the environment setup that you put in there, and then we turned off internet, and then the agent runs. And that's just for [12:07] Absolute maximum safety. [12:09] You know, I haven't actually ever seen it happen, but you know, there's like a theoretical risk or a tail risk that there could be like some exfiltration. Maybe the agent wants to go to Stack Overflow and ask a question about some code and it pasted it. Right. That's a theoretical scenario. And so we're starting really small in scope. And then over time, we'll start like adding more and more capabilities. Got it. And that actually brings me to my next thing, which is which is one of my gripes about this, which is it's not. [12:33] like, [12:34] The fact that it's not in ChatGPT means that it feels like a very one-shot type. [12:41] user experience, which to me makes it feel very like it's definitely for senior engineers where you can kind of like see the entire thing you need to build in your head and you just type it out and then it's done. Whereas for me, for example, like I might want to like go back and forth and chat with it to like figure out, okay, like what is it that I'm actually trying to build? Or in this case, in this particular session, I asked to build a feature and then I realized after I built it

13:11-14:41

[13:11] asked it a follow-up. And in the follow-up case, it didn't work quite as well as the one-off, one-shot thing. So tell me about that. [13:20] Yeah, it's really interesting. Like, delegating to agents... [13:24] requires like [13:26] A bunch of like mindset updates that actually I think will take time for people to learn how to get the most use out of them. [13:31] And we'll like, I think, end up temporarily in a state where we kind of have tools that feel more like you're delegating and tools that feel more like you're collaborating. And then our goal, at least, is bring those together so you don't think about it. [13:41] Again, I think about our future super assistant. I don't think about it so I'm delegating. I just ask it for stuff, and it just does stuff. [13:49] But people are pretty used to working collaboratively with tools locally on the computer right now. And in fact, we have the codec CLI, which is exactly that. So you can do it back and forth with CLI, have it access to some commands. It'll ask you for permission because it's on your computer, but you can kind of sit with it and kind of pair program with the CLI. [14:05] So, [14:06] A little bit about the mindset shift that we've seen people who get a lot of value out of Codex, like the agent that you delegate to. [14:15] I'll just share for, because it's, you know, kind of interesting here. One of them is like an abundance mindset. [14:20] So something that we've seen a lot of people who are using this a ton do is they just fire off many, many, many tasks without thinking actually too hard about whether it's going to work or, you know, whether it's like perfectly described or not. And whenever you need to do something new, you just kind of fire off a new task. And so, for example, a really common use case is like on-call triage.

14:41-16:11

[14:41] It's like, [14:41] Maybe you don't know... [14:43] if what the bug is. You know, you were talking about how like, oh, like I've noticed it works best if I can kind of see through the full arc of the feature. Like you don't need to know that. You just fire it. And then sometimes it's like, [14:53] hey, here's like the exact fix. And you're just like, wow, like amazing. Not only did I save myself time, but that bug was fixed faster than I could have ever gotten to it otherwise. Other times, maybe it just like, [15:04] provides a draft and then you have to land it in your own time. And other times it like just maybe finds where the bug is or helps you just reason about it. [15:11] And so it's kind of like this idea of like, this is a service to me that accelerates me, the developer. And [15:17] I'm going to use it without necessarily always knowing if it's going to like [15:21] provide emergable PR or not. That's like on-call triage as an example. Another like very analogous place we see people use it is like, [15:29] Um, [15:30] Every morning, people will kind of think about what they want to do. And then they're at home, they'll just be on their phone, fire off some tasks. We have mobile support coming soon. Well, we already support mobile web and mobile app on iOS. We'll support soon. So they'll just be like, this is what I want to do. Fire off some stuff, then get to their desk. [15:47] and look at what they want to use. [15:49] Um, [15:50] So it is quite one-shotty in the way people use it, like those two examples I gave you. So I think you're right about that. And so we've started poking at this, like does it need to be one-shotty? And so you saw in the UI there's, [16:03] When you send a task, you can choose if it's going to be in ask mode, [16:06] or code mode, [16:08] And code mode is definitely the thing we've spent the most time on.

16:11-17:45

[16:11] We're starting to experiment with making Ask Mode faster, [16:15] Um, [16:16] And so, [16:17] Like, you know, one way that this could evolve potentially is where, you know, [16:22] When you want to do what I just described, like fire off a one-shot or a few-shot thing, you basically ask code mode. But if you want to be a little bit more exploratory, you can try ask mode. [16:31] And so I don't know if you tried this, but one thing you could do is ask for some recommendations using Ask Mode. [16:37] And that should be relatively fast. It'll probably finish pretty quickly during this call. And then we could try kicking off some of those tasks if you want. [16:44] Okay, so what nations should I ask for? Yeah, like, I don't know, is there like a class or that you find very complicated that you would love to refactor? [16:52] in some code somewhere? I wish that I was familiar enough with this code base to tell you. Okay, what about just the... [17:00] suggest, let's run two prompts. And again, this is abundance mindset. Let's run one like suggest places to add documentation. [17:07] Um, that's a safe one. Actually, a lot of people use it for documentation and it's really good at documentation. And then let's do another one, which is like, [17:15] find some bugs and suggest some tasks to improve the bugs or to fix the bugs. [17:21] Cool. I think that this is really interesting. I want to pause here for a sec, like the sort of a bundles mindset that you're talking about where [17:29] um, [17:30] Yeah, you can just direct intelligence at a problem and you don't have to have it that well formed. And if it doesn't work, it's fine. And that's just a different way of thinking about how to program or how to manage a resource that I think is really interesting.

18:00-19:32

[18:00] stepping through it step by step, but it's like, it's not every time. And that's just an interesting shift. It's a different skill. [18:08] it seems like. Yeah, it's like, [18:10] I think... [18:11] I figure you probably use Chachapiti Deep Research on occasion. [18:14] Right. And like, [18:16] If I think about how some of my habits have changed, I deep research a lot of things that I wouldn't have spent, I don't know how long it would take me to do a deep research, but probably quite a long time, like hours. And now I read deep researchers on custom queries that I have way more often. It kind of lets me be more curious in a way. [18:34] Yeah, like, you know, many things a day, basically. And I think there's kind of a similar thing here with agents in general. [18:40] Yeah, it's like you find you have more questions to ask when it's really easy to ask a question to get a really good answer. You know, I find that too. I'm just I've learned way more because of O3 and deep research and just like AI in general. [18:53] What about like, I feel like there's a tension here that I sense in this product and I've sensed myself in trying to build these things. There's a sort of tension between... [19:02] making something good for a particular situation, like, for example, like, you know, [19:07] uh this is great for picking off our bug or feature um [19:12] But you kind of get this like loss of flexibility, where it's not, for example, [19:17] It's not nearly as good at just doing follow-up questions, and ChatGPT is just great at that. And so how do you think about that when you're designing something like this, the sort of trade-off between specializing and the brittleness or loss of flexibility that you get, and how do you manage that?

19:32-21:04

[19:32] Yeah, I think the long-term goal, I'm sure you hear this a lot, but it's the G and AGI. It's to build something pretty general and build. [19:42] Um, [19:43] A lot of specialization, at least when it comes to like building UI, is a bunch of design decisions that we have to make. But if you're just talking to an entity, it can make a bunch of decisions about the right behavior and just kind of be general for you. Right. And so. [19:56] kind of the [19:57] Where I see things like this going and maybe just like working with agents in general is you can have a conversation with an agent and that should be an incredibly general thing. Like I was mentioning earlier, it shouldn't only be for coding. It should just be an agent. And it should be able to work across any modality, whether you're in your car or at your desk. [20:15] Um, [20:16] So that thing stays general. [20:18] And it should be able to do all the things it needs using tools, et cetera. [20:22] But, [20:23] When we're professionals doing a specific thing, like right now we're recording this podcast, and this is some pretty specialized UI, but it's really good. It would be quite annoying to have to use a chat interface to mute myself or adjust my volume here. And so kind of the way I think about it is there can be nearly an infinite number of bespoke, possibly even AI-generated UIs [20:42] that serve the specific purpose of the user who knows exactly what they want. [20:47] And so, like, for me, like, the most interesting thing about this Codex Research Preview isn't actually the UI, although don't get me wrong, we've spent a little bit, like, rewriting it massively as we speak, but... [20:58] The most interesting thing is actually... [21:01] training a model that's designed to work more independently,

21:04-22:35

[21:04] and make the most of that, asynchronicity, and its own compute environment, and then building the actual compute environments that we can scale up and let people set up the agent to have the tools that it needs. So once you have those two things, the model and the compute environments, we can then start bringing that [21:21] that UI to everywhere. So, you know, I would, like I mentioned, I would love for that to be more readily available. Just like if you're talking to chat GPT, um, [21:29] I would love for that to be available also closer to where developers spend their time, right? So if you spend a lot of time in terminal, then like the Codex CLI would be a perfect place for you to be able to like... [21:38] you know, delegate to that agent. Similarly, if you spend time in your editor, or in your CI, or in your issue tracker, like all these places should be places where the agent is just sort of like ubiquitously available for you to delegate to it. [21:49] Well, so let me just push on that a little bit. So in a perfect future, basically, I'm talking to ChatGPT, and ChatGPT is the one that's calling Codex and telling it to, okay, go do this thing, and then handling the follow-ups to be like, actually, I want you to do it a little bit better. And I'm maybe not even talking directly to it. ChatGPT is almost like the middle manager. [22:11] Um, yeah, that, that, I think that is one possible future. [22:15] And I think that is a future that [22:18] might you might use if you're like on the go like on your phone not at your desk or you know let's say I don't spend like a ton of time like on marketing but I want to like interact with some like marketing tools like maybe I'd go through chat because I don't have mastery of the the actual like underlying systems.

22:35-24:06

[22:35] But then if I'm like a full-time person who spends a lot of time like coding or, you know, whatever thing that it is that I'm doing, I probably have a lot of like mastery that I've learned over time. And that also brings me great joy in my life, hopefully, to spend time with these tools. And so then I would like go straight to the tool and the I should be like just available in that tool. [22:56] Yeah. So I totally get that, especially for the interface layer and having specialized interfaces for specialized tasks makes total sense. What about the model layer? Obviously, the goal is to have a perfectly general model that no matter what you ask it, it can just fulfill the request. But also, it seems like, for example, when you did reinforcement learning on this model to make a better, more senior software engineer type personality. But even within that, this model clearly has a different personality than cloud code. [23:26] or the underlying model for cloud code. [23:29] And so how do you think about that? Even if you're trying to be as general as possible, there's always these sorts of trade-offs between making it good for that thing and making it generally useful. [23:41] you know, it's flexibility in other areas or its personality. [23:45] Totally. Yeah, I mean, [23:48] I feel like at OpenAI, at least, we've kind of done this like a few times now where... [23:53] We've shipped ChatGPT, a ton of people use it. [23:56] And then we get some feedback from a specific audience we care about. So like with GPT 4.1 developers, and we just, you know, spend a bunch of time with them.

24:06-25:57

[24:06] this is a different team, but, you know, they spent a bunch of time like talking with developers, understanding their feedback, and kind of like creating like a different set of evals that we realized we like wanted to improve on, we cared about. And so then we had this decision, right? Like, do we try to, you know, get a bunch of like improvements into the mainline model? [24:25] which takes time and has a bunch of trade-offs? Or do we kind of want to speed run it on the side, ship a separate model, [24:31] and then reintegrate some of those changes. And I think our belief as a company is that there are going to be times where we want to speedrun something. [24:39] often just maybe even for learning purposes. [24:41] just to like have the flexibility, like really go deep with a set of customers and like to kind of like generate the evals that they have in their heads. [24:49] And then from there, ship it, validate, and then take those configs or those changes that we're making to the model and re-mainline them. And it seems like whenever we can do that, the whole overall system gets better and maybe even more powerful than a custom side thing that we've done. [25:04] That's really... [25:05] Yeah, go ahead. [25:07] And what that makes me think about, because one of my other little gripes with this is it reminds me kind of of Operator. [25:14] where [25:15] this cool thing, but it's like separate from chat GPT. And like, it's, [25:19] I don't think that you guys have released an operator update since the first one. And so it just feels like it's hanging out there and not necessarily getting improved. I'm not using it as much as I am using a base chatGPT. And so... [25:33] Is that kind of the like one version of the vision is obviously that codex stays stays as a separate thing. But then you integrate all the learnings back into like the base 03 that I use inside of chat GPT. And that's how everything coexists together. How do you kind of make sure that you're not building like this balkanized thing that's kind of forgotten about in three months or whatever? Totally. Yeah, man, I have so many, so many fun thoughts here.

25:58-27:28

[25:58] One thing I'll say is that [26:00] uh, [26:01] Yeah, we have... [26:02] a lot of [26:04] So these capabilities that we've talked about being different and separate, [26:09] are coming together. And like, that's the thing we think a lot about. And it's, [26:13] I'm very jazzed about it. So excited to share more as that happens. But let's take Codex. Um, [26:21] the, uh, [26:22] In my mind here, there's kind of like, [26:25] two interesting things. The first is what we said, which is, hey, anything we learn as we build this codex model, [26:32] is like really helpful to us just as we think about agents in general and like, you know, like this idea of like, say, like citations, right, or like describing your work really efficiently for the user to read after you've done a lot of work. [26:44] like tuning the steerability so that it like kind of makes the right amount of assumptions, which I'm not saying we nailed perfectly, but like that's a that's a theme like that's a generalizable thing for agents and like something that's going to be useful in general. [26:57] Similarly, um... [26:59] And I don't know if you have this in your briefing, but we are planning to release an update, an updated small model for use in the Codex CLI. [27:08] And, you know, there's a lot of, I mean, it's pretty obvious, right? There's a lot of overlap between these two models. And then actually, you know, a lot of the learnings that we have from like both of those codex models, like we plan to use even in like mainline models, right? So there's a lot of that generalizing that comes back. [27:22] I feel like your question also, though, is a bit about user experience and having to go to operator or having to go to Codex. And so like,

27:29-29:14

[27:29] You know, I think, [27:30] And Codex is kind of interesting because it's maybe one of the first... [27:34] products or features we've built into ChatGPT where we have [27:38] a professional audience that like spends a lot of time in their tools every day. [27:43] you know, like a lot of how, like Chachapiti overall is quite a general tool. And I think this is like maybe some of the first time that we're getting like fairly specialized. And so my view here is that like, [27:53] We're not exactly sure what this should look like. I think like Chachapiti a year from now where we have more of this will look fairly different from Chachapiti today. [28:00] But the goals we'll have in mind will kind of be these things that I've been saying, which is like, I think we're trying to build a generalized assistant and we cannot lose sight of that. But we should give ourselves the room to give users [28:11] the power UI that they want. So for instance, just to give you an example, going back to your opening question about the UI, [28:19] it's a list, right? And it's a list and it has features like Archive. [28:23] Like, why do we have that? Well, it turns out that [28:27] People internally who use Codex use it a ton every day. And when they're using it, [28:32] and it's just like mixed in with chat, it's a little hard to find your things. And then, you know, there's kind of this workflow you develop where you're basically you're like, hey, I have an idea. Like, for example, like, so I'm on the product team. I mostly do small changes. People make fun of me affectionately at work. But you know, the string should be different. Or there's this bug here, right? And if the bug's really hard to fix, I'm like, that's vegetables. I'm not going to eat [28:57] It's a small niche too. Yeah. But if it's like an easy bug, you know, I'll wham it. I'll explain what the word wham is later, but that's our code name for the project. And so, you know, I'm whamming a bunch of stuff and then I don't want to track it separately. Like I just want to have this list of things. And then like when they're done, my workflow is basically look at each one,

29:14-30:50

[29:14] and make a decision on if it's mergeable, or if I want to pull it open to my computer, test it, maybe change it, or if I'm just like, this is over, I don't have time to do this. And so that workflow [29:24] really needs that UI of like, did you open a PR? What is the status of that PR? And the ability to archive so that I'm only looking at the things that I, my current tasks basically. [29:33] So that's just an example of like the power user UI that, [29:36] Um, [29:37] That makes a lot of sense. Yeah, we feel it's important. Yeah. Yeah, I feel like chat is only appropriate when it's incredibly underspecified what you actually want to be doing. Whereas as soon as you know exactly what you should be doing, like chat is not really the right interface. [29:50] I'll add one thing to that, which is actually the other cool thing about chat is that, you [29:55] it unlocks a lot of emergent behaviors that are [29:58] Just mind-blowing, right? So when... [30:01] I'm obviously really excited to bring us back into chat as well. But one experience that I had when we prototyped in chat was... [30:10] I had a... [30:12] I had asked for a change on the front end. And actually, when I was getting started on this project, I had spent like five years, including before OpenAI, just like writing native Mac OS apps. [30:21] uh, [30:22] And so I'm not super familiar with the latest... [30:26] tailwind, whatever stuff. So I asked for this change and a bunch of classes were added in line. I'm like, I don't even, what is this class? So I asked, but we were in chat. So I just replied with, [30:36] What is this? Can you explain it to me? And because we were in chat, and the context of like what was changed was just like in line in chat, chat replied quickly, which was the thing you were saying is one of your gripes that this doesn't reply as quickly, right? Chat replied quickly. And it asked me, do you want me to generate an image for you?

30:51-32:36

[30:51] And I was just mind blown. And I said, absolutely. And then it generated an image and, you know, I was very happy. And so I think, [30:58] I think it's important that we always make sure that the overall assistant has access to these tools so we can experience things like this and then work those into the more functional way. [31:06] like UIs as well. [31:07] How do you feel about, and with your own personal experience and what you see internally, or I don't know, other beta customers or alpha customers that are using this, like, how this changes what it is to be a developer, what the developer experience is. And I'll give you an example. In my testing over the last couple days with Kieran, again, he's the GM for Quora. He's, you know, super technical. I'm like, technical, but like, I can just build stuff to make it work kind of way. [31:37] call. We were both screen sharing. We both had a bunch of agents up. And we were just like talking about how we wanted to make the product better, Cora better. And then he would just say, okay, I want you to like, you know, build this little thing. And I'd be off being like, I want to fix this little bug. And both of us have like these different agents all working on different features and bugs all at the same time. And we're just chatting while we do this. Like we have the brain space to be able to like talk about what's going on. You know, an agent would come back, [32:07] and be like, is this good? Is this not good? And then either edit it or merge it. And it was just like a very different, it was like social coding in this like weird way where we're just like, [32:16] chatting together and work is getting done, which I've never really experienced before, and feels like a new model for building things that was not really possible previously. And I'm kind of curious if that resonates with your experience, or if there are other things that you're noticing about how this changes what it is to be a developer, what developing looks like.

32:37-34:07

[32:37] Yeah, that sounds pretty fun. [32:39] I'd even say, even before we had any, before Codex, [32:44] I've already started feeling like coding is like more social. It's like... [32:48] even if it's just me coding with an LLM. [32:52] And yeah, you know, like that startup I was I mentioned that I was working on, like I built the first prototype for one of the things that became it in the airplane without Wi Fi. [33:01] And I would never do that today. Like if there was no Wi-Fi, it's just over. Like I'm going to watch a movie. I just can't do it. [33:08] And you know, I do think this is, we're kind of on the precipice, but never like change of similar magnitude. So the vision for like what, [33:16] it should feel like to be a developer. [33:19] if we get things right, and maybe what it feels like in general to just be a knowledge worker. The vision in my mind is you should be [33:27] Um, [33:28] able to do the work that you want to do, [33:31] And maybe that's because it's work that's difficult to automate, which is a lot of work. [33:35] Maybe it's work that's like very ambiguous. [33:38] that requires a lot of complex decisioning, or maybe work that you just want to do for fun, or you want to be creative. And that's the stuff you do. And I think it's really important that we invest in [33:48] making sure that when you're doing the work yourself, you're maximally accelerated, right? So like things like the Codex CLI that you can pair with, I think are critical for us to invest in. [33:57] But then the lion's share... [34:00] of kind of like known work, [34:01] should just be done by agents that you're delegating to. [34:04] And so I think this kind of like shifts where we spend our time.

34:07-35:38

[34:07] and maybe we're spending more time in [34:11] planning or like thinking about what to do more time in design. [34:15] and then more time in validation. [34:18] as well. [34:20] Um, now I do think it's interesting, like, [34:23] Right now, I think at least for the next couple months, we are going to also be spending time thinking about how to set up the environment for agents. [34:33] And so it's not only like, I think we're only going to be spending our time being creative. I think there's also going to be a lot of like, it's a bit like becoming a manager and you're thinking about, I have this team and how do I enable my team to be productive? [34:44] I mean, and it creates new problems that you still need to solve. So a really interesting example that I'm noticing is... [34:53] Thank you. [34:53] You can implement so many features that the features start to crowd in on each other and the product just feels like less well thought through because like you have so much throughput to just go like do this little thing and it just takes 10 minutes and then you've done that like 50 times and you're like this product is like way too bloated and no one's like step back and been like, how does all this stuff fit together? [35:14] And that, I think, is... [35:16] like we need new product management or like hygiene practices for we have so much capacity to build stuff or fix bugs like [35:27] how do we deal with the overproduction of features almost, you know, which is a new problem. [35:33] Well, I have some takes on this. So I don't think this applies to everyone.

35:38-37:08

[35:38] But at least I'm happy to hear what you're saying in that I think like taste is like fun to exercise. [35:43] Yeah. And so like, [35:45] And also like great fun for humans, you know? So it's like, I'm excited like, oh, we can produce too many features and now we have to like choose which ones. I'm like, this is a great future. Like, wow, product person's dream. But obviously I know that doesn't apply to everyone. But for me, that sounds awesome. The other thing that I like hope that happens [36:01] is that [36:02] Um, [36:03] we actually just build many more software products. And I don't mean like many more features, but like many more apps. And this again is, this is just like my personal opinion, but I really like these like small, like beautifully crafted apps that have like a small number of features. [36:16] And almost the more niche the use case, the more fun it is to use them for a little bit. [36:23] So like, you know, an example in my mind that I've like poked at, but haven't really spent time on is like maybe like a texting app, but for just me and my wife only. [36:31] You know, like that would just be kind of a slightly ridiculous thing to consider doing while you have a full time job like a while ago. But like, I would love to have that app and like it doesn't have to have many features, but it's just for me. [36:41] right? And I think that like, [36:43] as software development gets more and more accelerated, my hope is that because it's like we can produce more of it, we end up actually using much more, but in much more bespoke ways for people. [36:53] that makes sense yeah and I think you can see that in the history of other art forms like [36:57] you know, photography or illustration where 150 years ago, like you had to do like woodcut blocks. And then we figured out how to like do mass reproductions of photos. And it's just, there's a whole...

37:08-38:39

[37:08] It's a similar kind of thing where now we all have a camera in our pocket or we can do Studio G like generations or whatever, you know? Well, yeah, now we're recording a podcast together, right? Like for on a specific topic. Yeah, exactly. So I want to go back. Our ask query came back a while ago. So I'm going to make... [37:27] forget about that. So we asked it to suggest documentation. And so it gave me some interesting things. It said, the repos guidelines, I'm sorry, it's... [37:37] documenting Ruby classes with Yard and stimulus controllers, the apprehensive JS doc blocks, C development guide, and then just found them. [37:44] some places where ER documentation is lacking. [37:48] Yeah, so this is, [37:49] Real quick, this is cool. This is not what I was hoping you would see. [37:52] We clearly have some prompt tuning to do between now and tomorrow because we're [37:58] Can we check out the other one real quick? So this is good. [38:01] But what I would love to see in there is that it actually stubs out tasks for you. Oh, there we go. This one did it. And I think that's because we said suggest some tasks, like we put those words in the prompt. And so it's kind of similar, right? It's answering kind of the question, but it's giving you these buttons you can click to like actually do the thing if you want to do it. [38:19] Interesting. Okay, so I'm seeing, so we said, find some bugs, and suggest some tasks to fix the bugs. And it says issue number one, and it gives me some relevant code. And then... [38:27] It has a play button on the code. [38:30] which if I click it, [38:32] just does the changes that it's suggesting. [38:35] Exactly. [38:36] Right, and so this is us beginning to like,

38:39-39:57

[38:39] sort of play with this this this gripe that you have right of like the sort of the interaction loop [38:45] And, you know, [38:46] you know, us to begin to think like, yeah, how can we make it so that it's like, if you want to be more quickly like sort of collaborating with the model to figure out what to do, you can, you can do it. And also, in a way, how can we encourage it [38:58] how can we, you know, this is going back to, like, a lot of what we're spending our time thinking about is, like, this future where agents are writing a lot of code, like, how do we make that work well? [39:06] You probably would rather get a large refactor done in a bunch of small PRs that don't have merge conflicts and all independently compile. So for instance, a common use case is if there's a large file, you'll suggest several tasks to... [39:19] um break this down into a smaller class and then you can like do the refactor and like a few small buttons so basically this is this is something we're thinking about and like i think long term like our goal is to like [39:30] converge delegation and pairing, real-time pairing, into just a single experience. That's interesting. Okay. And we only have a couple minutes left, so I really need to ask, how do you see the agent landscape evolving? So there's a lot of agents. It feels like every programmer is working on their own agent. There's the big labs, like Cloud Code's got their agent. [39:52] There's a ton of startups. There's the Devins of the world, and there's a couple that are coming

40:00-41:33

[40:00] And so how do you think about this, like codex versus every other agent, the difference in the positioning or the functionality? And then do more generally, like, how do you see that market evolving? Is there like, is this sort of like a one agent to rule them all type situation? Or do you think there's going to be sort of like this ecosystem of agents with different personalities? [40:20] Yeah, I mean... [40:21] I just think this is an incredibly exciting time to be doing software development because there's just so much innovation happening and there's like so many good products out there. So, yeah. [40:31] personally just like really excited about that um [40:34] I think the things that were... [40:36] we're really focused on and uniquely positioned to do well or like, [40:39] Well, the fact that we basically can train the model. [40:42] And we can go as deep as we want into making a model that's as good as possible for this specific use case. And so the things that I'm most excited about us doing are things like... [40:51] well, obviously having really good coding intelligence, but then beyond that, really thinking about, okay, but this is going to be used by an engineer delegating to many agents in parallel. How do we make that mergeable? How do we get the style right, the instruction following, the steering, have its site [41:07] um, it's work, stuff like that. And so I think like for us, like really leaning into the model is like definitely like one thing that we're going to continue to do. [41:15] And then, [41:16] I think the other thing is like, [41:18] Just like thinking about like this scalable compute infrastructure, it just happens to be something that we're like quite good at here because of the way we do training. [41:26] And then lastly, for me, [41:29] And this might be like a little bit more out there, but like ultimately, like,

41:33-42:38

[41:33] So the reason that I work here, you know, during the benefits of AGI to all humanity, and if I think of the shape of that, that's not like some like ultra bespoke thing. It's like this general. [41:43] like AGI super assistant. And so like, that's kind of the world that I want to live in. Like I want to wake up and just like, well, probably not immediately. Maybe go for a walk first. Don't look at my phone. But then at some point I want to just like reach out to an assistant, like have a conversation with it and like just have one thing and it does stuff and then I can like dive in and do stuff myself. And so I think like, [42:01] That's, [42:02] I feel like the ChatGPT as an app and as an organization is very AGI superassistent-pilled. And so for me, that's kind of where we end up bringing this towards. [42:12] Yeah, that makes sense. The integration to HGPT is the big thing that's going to make this very different from any other tool like this. Yeah, there's nothing like HGPT. Yeah. [42:24] Alexander, it was so great to chat. Thank you so much for coming on. Thanks for building this. Thanks for letting me try it early. I'm psyched to see where it goes. [42:31] Cool, yeah. Thanks so much for trying it early. Thank you for all the feedback. Hopefully you notice many of the issues are addressed. And yeah, thanks again.

Want to learn more?

Ask about this episode