How OpenAI Cut Agent Latency With WebSockets and Smarter Caching

How OpenAI Cut Agent Latency With WebSockets and Smarter Caching

Every millisecond counts when you’re running an AI agent in a loop. OpenAI just published a detailed technical breakdown of how it rearchitected the infrastructure behind Codex using WebSockets and connection-scoped caching inside the Responses API — and the results are genuinely impressive. Less overhead, faster model response times, and a cleaner pattern for developers building long-running agent loops. This isn’t a product launch. It’s a peek inside OpenAI’s engineering kitchen, and what’s cooking is worth paying attention to.

Why Agentic Workloads Break Traditional API Design

Here’s the thing about agents: they’re not one-shot requests. A single Codex session might involve dozens or hundreds of sequential model calls — read a file, write a function, run tests, fix errors, repeat. Each one of those calls, under a traditional REST API model, involves spinning up a new HTTP connection, authenticating, shipping a full context payload, and waiting for a response. Do that fifty times in a row and the overhead compounds fast.

The classic request-response model was designed for discrete, stateless interactions. You ask, the server answers, connection closes. That works fine for autocomplete or a single chat turn. It starts to feel like driving on a highway with a stop sign every fifty feet when you’re running an autonomous coding agent that needs to iterate rapidly over a codebase.

OpenAI’s engineers clearly felt that friction. The Codex agent loop — the backbone of the product that now reaches over 4 million weekly users — was hitting real-world latency ceilings that weren’t about model speed. They were about connection setup, context re-transmission, and cache misses. Fixing the model wasn’t the answer. Fixing the plumbing was.

What Actually Changed: WebSockets and Connection-Scoped Caching

The solution OpenAI landed on has two interlocking parts, and both are worth understanding in detail.

Persistent Connections via WebSockets

WebSockets replace the standard HTTP request-response cycle with a persistent, bidirectional channel. Instead of tearing down and rebuilding a connection between every model call, the client and server maintain an open pipe for the life of the session. Messages flow both ways without the handshake tax of repeated connection setup.

For agentic workflows, this is a meaningful architectural shift. The Codex agent loop can now stream tool outputs back to the model, receive streaming responses, and fire off the next turn — all without the latency penalty of re-establishing HTTP connections. OpenAI’s post describes this as dramatically reducing what they call “API overhead,” which is the gap between when a request is logically ready and when the model actually starts processing it.

WebSockets aren’t new technology — they’ve been a web standard since 2011 and are widely used in chat applications, live dashboards, and multiplayer games. What’s notable here is OpenAI applying them systematically to AI inference pipelines in a way that’s exposed to developers through the Responses API.

Connection-Scoped Caching

The second piece is arguably more interesting from an AI infrastructure standpoint. Connection-scoped caching means that certain computed artifacts — specifically the KV (key-value) cache that transformer models use when processing context — are held alive on the server side for the duration of a WebSocket session rather than being discarded after each request.

To understand why this matters, you need a quick primer on how large language models handle long contexts. When a model processes a prompt, it computes attention keys and values for every token in that prompt. That computation is expensive. Under a stateless API, every new request has to recompute the KV cache from scratch, even if 90% of the context is identical to the previous call — which in an agent loop, it almost always is.

Connection-scoped caching holds that KV cache warm on the server, tied to the WebSocket session. When the next turn arrives, the model can skip recomputing the stable prefix and jump straight to processing the new tokens. OpenAI’s engineering post reports this produced measurable improvements in model latency — not just API round-trip time, but actual time-to-first-token, which is the number developers care about most during iterative agentic execution.

What the Codex Agent Loop Looks Like Now

Putting it together, the revised architecture for the Codex agent loop looks roughly like this:

  • Client opens a WebSocket connection to the Responses API at session start
  • First model call establishes the KV cache for the shared system prompt and initial context
  • Subsequent turns stream new tool outputs and user instructions over the open connection
  • The server reuses the cached context prefix, computing only the delta
  • Responses stream back in real time without connection teardown between turns
  • Session ends and cache is released when the WebSocket closes

This is a cleaner loop. And importantly, it’s a pattern that any developer building on the Responses API can now adopt — not just OpenAI’s own products.

What This Means for Developers Building Agents

OpenAI publishing this breakdown isn’t just transparency theater. It’s a signal about where the Responses API is heading and what patterns they expect developers to build around.

The Performance Gap Is Real — and It Was Holding Agents Back

Anyone who has built a multi-step agent on top of a REST API knows the pain. You’re paying latency costs that have nothing to do with model capability. For simple workflows, it’s tolerable. For agents that need to take fifty actions to complete a task — think a Codex session debugging a gnarly codebase — those connection overhead costs add up to seconds, sometimes tens of seconds, across a full session. That’s the difference between an agent that feels responsive and one that feels like it’s thinking through mud.

The Agents SDK updates around native sandboxes and smarter execution addressed part of this problem at the tooling layer. WebSockets and connection-scoped caching address it at the infrastructure layer. These things are complementary, not redundant.

Competitors Are Watching

Anthropic’s Claude API, Google’s Gemini API, and Mistral’s endpoints all face the same fundamental challenge: they were designed for request-response patterns and are being stretched to support agentic, multi-turn, long-context workflows that they weren’t originally built for. Google has been investing heavily in streaming and long-context capabilities in Gemini, but there’s no public equivalent to what OpenAI described here in terms of connection-scoped KV cache persistence.

This matters because inference efficiency at the session level translates directly to cost and user experience. If OpenAI can serve agent sessions with less compute per turn thanks to cache reuse, that’s a margin advantage and a speed advantage simultaneously. Competitors will need to respond.

For enterprise customers — the Hyatts of the world deploying ChatGPT Enterprise across global workforces — faster, cheaper agent loops translate directly to practical utility. Agents that complete tasks in thirty seconds rather than two minutes get adopted. Ones that feel sluggish get abandoned.

Developer Adoption Requires Documentation, Not Just Architecture

One thing I’d flag: WebSocket support in APIs is not always trivial to implement on the client side, especially for teams used to simple HTTP calls. OpenAI will need solid SDK support and clear documentation for this pattern to see broad adoption. The engineering post is a great start, but the developer experience around session lifecycle, error handling, and reconnection logic needs to be airtight. Persistent connections introduce failure modes — dropped sessions, stale caches — that stateless APIs simply don’t have.

Key Takeaways

  • WebSockets in the Responses API eliminate repeated HTTP connection overhead between agent loop turns, reducing latency at the infrastructure level
  • Connection-scoped KV caching means the model doesn’t recompute attention over stable context prefixes on every turn — a direct improvement to time-to-first-token
  • The Codex agent loop was the proving ground for these changes, and the improvements are now available to any developer using the Responses API
  • This is an infrastructure-layer fix to a problem that model improvements alone couldn’t solve
  • Competitors including Anthropic and Google haven’t publicly matched this specific capability yet
  • Enterprise and high-frequency agentic use cases will feel the biggest impact — consumer single-turn usage won’t notice much difference

Frequently Asked Questions

What is connection-scoped caching in the Responses API?

It’s a server-side optimization where the KV cache computed during model inference is kept alive for the duration of a WebSocket session. This means the model doesn’t have to reprocess unchanged context on every new turn, which cuts time-to-first-token in long agent loops.

Do I need to rewrite my existing Responses API integration to use WebSockets?

Not necessarily — traditional HTTP requests still work. But if you’re building multi-step agents or long-running loops, migrating to WebSocket-based sessions is where you’ll see meaningful latency improvements. OpenAI’s documentation should be the first stop for implementation details.

Is this available to all API users or just Codex?

OpenAI’s post frames this as a Responses API capability, not a Codex-exclusive feature. Developers building their own agents on top of the Responses API should be able to adopt the same WebSocket pattern that Codex uses internally.

How does this compare to what Google or Anthropic offer?

Google’s Gemini API and Anthropic’s Claude API both support streaming responses, but neither has publicly announced connection-scoped KV cache persistence tied to WebSocket sessions in the way OpenAI has described here. That said, all three companies are iterating quickly on infrastructure for agentic use cases, so the gap may narrow.

OpenAI has essentially turned its own products into a live stress test for its API infrastructure — and then published what it learned. That’s a healthy pattern. As agents get more capable and more widely deployed, the engineering work underneath them becomes as important as the models themselves. The expansion of Codex with computer use, browsing, and memory means these sessions are only going to get longer and more complex. Getting the plumbing right now is smart. I’d expect to see connection-scoped caching and WebSocket-first patterns become table stakes across the major AI API providers within the next year or two.