ChatGPT apps, AgentKit, and Codex updates... oh my!

OpenAI’s latest DevDay was all about infrastructure. Here’s what shipped:

Apps in ChatGPT

New Apps SDK built on Model Context Protocol (MCP)
Conversational apps that surface contextually in ChatGPT
800+ million potential users for developers
Launch partners: Booking.com, Canva, Coursera, Expedia, Figma, Spotify, Zillow
Open standard that can run anywhere

AgentKit

Agent Builder: visual canvas for multi-agent workflows with versioning
Connector Registry: centralized admin panel for data sources across workspaces
ChatKit: embeddable chat UI toolkit for agentic experiences
New evals capabilities: datasets, trace grading, automated prompt optimization

Codex (General Availability)

Slack integration for delegating tasks to @Codex
Codex SDK for embedding the agent in custom workflows
New admin tools: environment controls, monitoring, analytics
10x daily usage growth since August
40+ trillion tokens served by GPT-5-Codex in three weeks
70% more pull requests merged per week at OpenAI

API Updates

GPT-5 Pro and Sora 2 now available in API
Reinforcement fine-tuning beta for GPT-5
Third-party model support in Evals platform

That’s a lot. Let’s dig into what this actually means and why some of these decisions should make you uncomfortable.

Apps in ChatGPT

The Apps SDK is OpenAI’s attempt to solve a problem that’s plagued AI interfaces since GPT-3: how do you let the model do something beyond generating text?

Mobile screen showing a ChatGPT conversation with a Booking.com integration. The user asks to ‘find me a hotel in Paris for two adults between 11/21–11/24 with parking facilities.’ Below, Booking.com displays search results for Zoku Paris and Le Meurice hotels with photos, prices, and amenities such as Wi-Fi, parking, and pool.

The traditional answer has been “function calling” or “tool use,” where the model outputs structured JSON that your code interprets. It works, but it’s clunky. You need to build the entire UI layer yourself, handle state management, deal with streaming responses, and figure out how to present complex interactions in a chat interface.

Apps in ChatGPT aims to simplify. Instead of ChatGPT calling your API and you building a separate frontend, you build the interface directly into the conversation. When someone asks “Spotify, make a playlist for my party this Friday,” the Spotify app appears inline with an interactive interface. You can browse, adjust, and interact without leaving ChatGPT.

This is built on the Model Context Protocol, Anthropic’s open standard that OpenAI is fully embracing. In theory, apps built with the Apps SDK can run anywhere that adopts MCP. In practice, we’ll see how much that actually happens.

AgentKit

AgentKit is OpenAI’s answer to a pain point in agent development: building complex workflows is messy. You need orchestration logic, tool management, evaluation pipelines, safety guardrails, and a frontend. Each piece involves different tools, different frameworks, and a lot of custom glue code.

Agent Builder provides a visual canvas for composing multi-agent workflows (think n8n). You drag and drop nodes for agents, tools, guardrails, and conditional logic. You can configure evals inline, version your workflows, and preview runs before deployment.

Visual workflow builders have a long history in enterprise software, and they all face the same problem: they’re great for demos and prototyping, then you hit edge cases that don’t fit the abstraction.

The real test isn’t whether you can build a simple agent quickly. It’s whether you can maintain and evolve a complex agent system over time as requirements change, edge cases emerge, and your understanding of the problem deepens. Visual builders tend to break down at that point because the abstraction becomes a constraint.

ChatKit is the UI component, a toolkit for embedding chat-based agent experiences into your own products. Canva supposedly saved “over two weeks of time” building a support agent and integrated it “in less than an hour.”

Building chat UIs with streaming responses, thread management, and thinking indicators can be tedious. ChatKit solves a real problem. But it also means agent interfaces look increasingly similar across different products. When every company uses ChatKit to embed agents, we get interface convergence. The chat paradigm becomes ubiquitous, whether or not it’s the best interaction model for every use case.

Evals

OpenAI is adding four new capabilities to their Evals platform:

Datasets: Build evals from scratch with automated graders and human annotations
Trace grading: End-to-end assessment of agentic workflows
Automated prompt optimization: Generate improved prompts based on annotations and grader outputs
Third-party model support: Evaluate non-OpenAI models within their platform

The dataset and trace grading features address a fundamental challenge in agent development: you need systematic ways to evaluate complex, multi-step behaviors. Traditional unit tests don’t capture whether an agent workflow produces the right outcome, only whether individual components work in isolation.

Automated prompt optimization is interesting but potentially dangerous. The idea is that the system analyzes human annotations and grader outputs, then generates improved prompts automatically. This works when the optimization space is well-defined and the metrics clearly capture what “better” means. It fails when the real problem is ambiguous requirements or misaligned incentives.

If developers standardize on OpenAI’s Evals platform for measuring all their AI systems, OpenAI gets data about how their competitors’ models perform, what use cases they’re being evaluated for, and where they succeed or fail. That’s valuable competitive intelligence packaged as developer tooling (smart).

Codex

Codex going generally available is the quietest but potentially most disruptive announcement from DevDay.

The numbers are striking: 10x daily usage growth since August, 40+ trillion tokens served by GPT-5-Codex in three weeks, 70% more pull requests merged per week at OpenAI. Nearly all OpenAI engineers now use Codex, up from just over half in July.

The new Slack integration lets you tag @Codex in a channel or thread, and it automatically gathers context, chooses the right environment, and completes the task. From Slack, you can merge changes, iterate, or pull the task locally. This is infrastructure for treating AI agents like team members. You don’t switch to a different tool or interface. You just @mention Codex in your existing workflow, the same way you’d ask a coworker for help.

The Codex SDK embeds the same agent that powers the CLI into your own workflows and apps. A few lines of TypeScript gets you a state-of-the-art coding agent:

import { Codex } from “@openai/codex-sdk”;

const agent = new Codex({});
const thread = await agent.startThread();

const result = await thread.run(”Explore this repo”);
console.log(result);

const result2 = await thread.run(”Propose changes”)
console.log(result2);

The new admin tools give enterprise IT teams control over how Codex operates: environment controls, monitoring, analytics dashboards, and managed configuration. Starting October 20, Codex cloud tasks count towards usage limits.

Codex analytics dashboard showing two charts: bar chart for daily code review issues by priority, and stacked area chart for sentiment of code review feedback over time. Set on a gradient background with faint code.

Here’s the hard question: what happens to junior developers when coding agents become this good?

The traditional software engineering career path assumes you start by handling straightforward, well-defined tasks while building expertise. Those tasks are exactly what Codex excels at. Code reviews, tech debt cleanup, repetitive changes, even implementing features from clear specifications.

If agents can handle the entry-level work, how do junior developers build the experience and judgment they need to progress? You can’t jump straight to architectural decisions and complex system design without first spending time in the trenches writing and debugging code.

Or maybe this is wrong. Maybe junior developers will learn differently, focusing on prompting and code review skills instead of writing from scratch. Maybe they’ll handle higher-level work earlier because agents take care of the implementation details.

The bigger infrastructure picture

DevDay 2025 was about making AI capabilities more accessible, more integrated, and more production-ready.

Agent Builder, ChatKit, and the Codex SDK lower the barriers to building agent systems. Apps in ChatGPT provide distribution to 800 million users. The Connector Registry and admin tools give enterprises the control they need to deploy at scale. Evals and RFT provide measurement and customization infrastructure.

This is how platform companies build moats. Not through the models themselves, but through the tooling and infrastructure that makes the models useful. OpenAI’s models might be slightly better than competitors, but the integrated development experience, the built-in distribution channel, and the enterprise admin capabilities are much harder to replicate.

The strategic question for developers is whether to accept this platform dependence in exchange for faster development and built-in distribution. It’s the same calculation mobile developers made with iOS and Android, web developers made with cloud platforms, and API developers made with AWS, with the difference being that AI infrastructure is moving much faster. iOS took years to mature. The Apps SDK is in preview now, with monetization and directory features coming “later this year.” AgentKit went from concept to general availability in months. Codex grew 10x in daily usage since August.

The question is about the future of software development itself. When agents can handle an increasing percentage of coding tasks, write documentation, review pull requests, and automate tech debt cleanup, what does the profession look like?

OpenAI’s answer appears to be: developers become orchestrators and overseers rather than implementers. You design systems at a higher level of abstraction, manage agent workflows, and review AI-generated code. The actual writing happens increasingly in collaboration with or delegation to agents.

This might be right. But it requires a fundamental shift in how we think about engineering skill, how we train developers, and how we evaluate productivity. Measuring pull requests merged per week (OpenAI’s metric) optimizes for volume, not for thoughtful design or long-term maintainability.

Originally published on the Handy AI newsletter →