How OpenAI, Thrive, and Crete Built a Self-Improving Tax Agent

Tax filing is one of those problems that seems deceptively simple until you’re actually inside it. Thousands of edge cases, constantly shifting regulations, jurisdiction-specific rules that contradict each other — it’s the kind of domain that breaks generic automation fast. So when OpenAI, Thrive, and Crete announced they’d used OpenAI Codex to build a self-improving tax agent capable of automating filings, catching its own errors, and getting measurably better over time, it wasn’t just a product milestone. It was a proof point for something the AI industry has been promising for years: agents that actually learn from production feedback without constant human babysitting.

Why Tax? Why Now?

Tax compliance is a $20+ billion professional services market in the US alone, and it’s almost entirely manual at the workflow level. Even firms with sophisticated software still rely heavily on humans to interpret edge cases, reconcile discrepancies, and verify outputs before submission. The cost of errors isn’t just financial — it’s reputational and regulatory.

The timing matters too. Codex has matured significantly from its early days as a code-completion tool. Today it’s being deployed as a full agentic system capable of writing, testing, and iterating on code autonomously — which is exactly what you need when your “automation” has to handle novel tax scenarios that weren’t in any training dataset. We covered Codex’s recognition as a Gartner leader in enterprise AI coding agents earlier this year, and this deployment shows why that designation wasn’t just marketing.

Thrive and Crete brought the domain expertise. OpenAI brought the infrastructure. The result is something that looks less like traditional tax software and more like a junior tax analyst who never sleeps, never gets tired, and actively improves their own workflow after each filing cycle.

How the Self-Improvement Loop Actually Works

This is where it gets genuinely interesting. Most AI deployments in enterprise settings are static: you train a model, you deploy it, and it stays frozen until someone manually updates it. That works fine for narrow, predictable tasks. Tax compliance is neither.

The architecture Thrive and Crete built with Codex follows a feedback-driven loop that has three core phases:

Automated filing execution: The agent processes tax documents, applies jurisdiction-specific rules, and generates filings autonomously. It handles the standard 80% of cases without human review.
Error detection and flagging: When the agent encounters ambiguity, conflicting rules, or low-confidence scenarios, it flags these for human review rather than guessing. Crucially, it logs the reason for uncertainty in structured form.
Self-generated test cases: Here’s the clever part — after each review cycle, Codex generates new test cases based on the flagged errors. These tests get folded back into the agent’s evaluation suite, so the same mistake essentially can’t happen twice without triggering an automatic failure in testing.

This last piece is what separates this from standard ML pipelines. The agent isn’t just learning from labeled data fed by humans. It’s writing its own regression tests from failure cases, which means its coverage expands with every production cycle. Over time, the human review burden shrinks because the categories of “things I’m uncertain about” get systematically closed off.

The Role Codex Plays

Codex isn’t just handling the natural language reasoning here. It’s doing the actual code generation — writing and modifying the logic that processes tax rules, updating calculation functions when regulations change, and producing the test scaffolding that validates outputs. This is a full software development loop running autonomously inside a compliance workflow.

Think about what that means in practice. When a new IRS ruling drops that affects how certain deductions are calculated, the agent can potentially generate the updated processing logic, write tests to verify it against known cases, and deploy the change — all without a developer manually coding the rule update. That’s not hypothetical; it’s what the collaboration with Thrive and Crete demonstrated.

Accuracy Numbers and What They Actually Mean

OpenAI hasn’t published full benchmark data yet, but the case study references measurable improvements in filing accuracy and a significant reduction in the time professionals spend on review. I’d be skeptical of round-number claims here — “X% more accurate” is easy to game depending on your baseline — but the structural argument is sound. When a system generates its own regression tests from real failures, accuracy should improve monotonically as long as the underlying model doesn’t degrade. That’s a different guarantee than “we trained on more data.”

The workflow acceleration claim is more straightforward. Automating the initial document processing and rule application phases frees tax professionals to focus exclusively on the genuinely ambiguous cases. That’s a real efficiency gain regardless of what the exact numbers are.

What This Means for the Tax and Financial Services Industry

Let’s be honest about who’s nervous right now. Mid-tier tax preparation firms — the ones that have built their business on volume processing of straightforward returns — are looking at a technology that can handle their core workload autonomously and improve itself without additional engineering investment. The big four accounting firms have the resources to build and deploy these systems. The solo practitioner can potentially access them through platforms built on top of Codex. It’s the middle tier that faces the sharpest pressure.

For enterprise finance teams, this is closer to a tool than a threat. CFOs managing complex multi-jurisdiction filings have always needed specialist help that’s expensive and hard to scale. An agent that handles the execution layer while surfacing only the genuinely complex decisions for human judgment is exactly what they’d pay for.

The Compliance and Liability Question

Here’s the thing that every enterprise buyer will ask: who’s liable when the agent files something wrong? This is the question that slows down AI adoption in regulated industries more than any technical limitation. Thrive and Crete have clearly thought about this — the architecture explicitly routes uncertain cases to humans rather than auto-filing them — but the industry still lacks a clear legal framework for AI-assisted compliance work.

Interestingly, this same dynamic is playing out in healthcare. AdventHealth’s deployment of ChatGPT for clinical workflows faced similar questions about AI-assisted decisions in regulated contexts. The pattern emerging across industries seems to be: AI handles execution, humans retain sign-off authority, and liability stays with the licensed professional. That framework works for now, but it will get tested as these systems become more capable.

Self-Improvement as a Competitive Moat

There’s a strategic dimension here worth calling out. The self-improving architecture doesn’t just make the product better over time — it makes it harder to replicate. A competitor can copy the initial model, but they can’t copy the accumulated test cases and failure patterns from thousands of real production filings. That institutional knowledge, encoded as automated tests, becomes a genuine moat. It’s the same reason enterprise software companies with decades of edge-case handling are so hard to displace: the value isn’t just in the code, it’s in the scar tissue from every weird scenario they’ve seen.

OpenAI is essentially helping its partners build that scar tissue at machine speed.

Key Takeaways

The Codex-powered tax agent uses a self-improvement loop where production failures automatically generate new regression tests, reducing repeated errors over time.
Codex handles actual code generation for rule updates — not just reasoning — meaning it can adapt to regulatory changes without manual developer intervention.
Uncertain or ambiguous cases are routed to human reviewers, keeping licensed professionals in the loop for compliance and liability purposes.
The accumulated test suite from production use becomes a compounding competitive advantage that’s difficult for new entrants to replicate.
Mid-tier tax preparation firms face the most direct competitive pressure; enterprise finance teams are more likely to see this as a useful tool.

It’s also worth watching how this deployment influences OpenAI’s broader enterprise strategy. Ramp’s use of Codex to cut code review time showed the product working inside developer workflows. The Thrive and Crete collaboration shows it working inside professional compliance workflows where the stakes are considerably higher and the domain knowledge requirements are more demanding. Each successful deployment in a high-stakes vertical makes the next one easier to sell.

The real test will come when these agents face a genuinely novel regulatory change — something no training data anticipated. How the self-improvement loop handles true novelty, rather than variations on known patterns, will determine whether this architecture scales into something transformative or remains impressive but bounded. Either way, the gap between what AI can do in professional services and what most firms are actually deploying continues to widen fast.

Frequently Asked Questions

What is a self-improving tax agent?

It’s an AI system that automates tax filing workflows and, critically, generates its own test cases from production errors — meaning it systematically reduces the likelihood of repeating the same mistake. Unlike static automation tools, it improves its own accuracy over each filing cycle without requiring manual retraining.

What role does OpenAI Codex play in this system?

Codex goes beyond natural language reasoning here — it generates and modifies the actual code that processes tax rules, writes test cases to validate outputs, and handles logic updates when regulations change. It’s functioning as an autonomous software developer embedded in a compliance workflow.

Who built this and is it available to other businesses?

The system was built collaboratively by OpenAI, Thrive, and Crete. It’s currently a case study deployment rather than a packaged product, but the underlying Codex API that powers it is available to enterprise developers who want to build similar agentic systems in their own domains.

How does this compare to existing tax automation software like Thomson Reuters or Intuit?

Traditional tax software applies fixed rule sets that require manual updates when regulations change. The Codex-based agent can generate its own rule updates and test them autonomously, which is a fundamentally different architecture. Thomson Reuters and Intuit have significant distribution and trust advantages, but their automation layer is far more static by comparison.

Anthropic Is Asking the Public Hard Questions About AI

OpenAI GPT-Live: The Voice AI That Wants to Sound Human

ChatGPT Work: OpenAI’s Agent That Actually Does the Job

OpenAI’s GPT-5.5 Bio Bug Bounty: What It Actually Is

GPT-5.6 Is Here: More Power, Better Value, Same OpenAI Ambition

GPT-5.6 Is Now Running Microsoft 365 Copilot — Here’s What Changes

How OpenAI, Thrive, and Crete Built a Self-Improving Tax Agent

Why Tax? Why Now?