✦ ONIONTHOUGHTS.IO · ANYA JACOBY · AI/ML STRATEGIST · AWS SOLUTIONS ARCHITECT · GENERATIVE AI · CLOUD ARCHITECTURE · PEELING BACK THE LAYERS ✦ ONIONTHOUGHTS.IO · ANYA JACOBY · AI/ML STRATEGIST ✦
PROFESSIONAL WRITING & IDEAS

Peeling back layers...

Hi, I'm Anya Jacoby! I'm a Compete Solutions Architect and AI/ML strategist. Here I write about cloud architecture, generative AI, and the evolving competitive landscape (or anything that really catches my eye).

Anya Jacoby

Latest Writing

The Framework Wars Were the Warmup

The ADK vs. LangGraph debate is real, but it’s the appliance argument in a kitchen renovation where nobody’s checked the foundation yet.

▶ READ ARTICLE
📊

Some Thoughts on Model Benchmarking and Use Cases

The AI industry evaluates models the way you’d hire a surgeon based entirely on a written exam. Two recent pieces of work show why the problem is worse than most people realize.

▶ READ ARTICLE
📋

Competitive Analysis Solutions Architect Roadmap

A structured learning path for becoming a competitive analysis SA specializing in AI/ML and Generative AI across cloud platforms.

▶ VIEW ON GITHUB
Anya Jacoby
ABOUT ME

Anya Jacoby

SOLUTIONS ARCHITECT, GENERATIVE AI · AMAZON WEB SERVICES

I'm a seasoned AWS Solutions Architect specializing in AI/ML and competitive cloud strategy, with a proven track record of multi-cloud expertise and cross-functional leadership. My work sits at the intersection of deep technical architecture and strategic communication.

I bring deep expertise across AWS, Azure, OpenAI, Google Cloud Platform, NVIDIA, Databricks, and Snowflake. From Amazon Bedrock and agentic workflows to GPU-accelerated deep learning and competitive deal advisory — I've worked across the stack and across the competitive landscape.

Before joining AWS, I was an Embedded Software Engineer turned Data Scientist at Carrier Global Corporation. I hold dual master's degrees (Applied Data Science and Machine Learning (with a focus on human-centered Artificial Intelligence), and an MBA) and dual bachelor's degrees (Applied Mathematics and Economics) from Syracuse University.

This site is where I publish my writing — long-form thinking on AI strategy, cloud architecture, and the ideas I find most worth sharing.


Technical Skills

AI/ML Strategy & Deployment Cloud Competitive Analysis Technical Enablement Content Architecture & Stakeholder Design Amazon Bedrock SageMaker AI Azure OpenAI / Foundry Agentic Workflows GPU-Accelerated ML (EC2/NVIDIA) PyTorch Databricks Snowflake Python Spark / SQL Data Engineering Pipelines Public Speaking Workflow Optimization

Certifications

AWS Certified Machine Learning – Specialty
AWS Certified Solutions Architect – Associate
AWS Certified Speaker
AWS Certified Cloud Practitioner
Microsoft Certified: Azure Administrator Associate

Experience

Feb 2025 – Present
AMAZON WEB SERVICES
Compete Solutions Architect II
Led end-to-end AI/ML and agentic architecture across AWS and Azure. Produced 120+ competitive assets and 9 flagship deliverables; served as trusted advisor in high-stakes competitive deals while influencing internal product and roadmap discussions.
Dec 2022 – Feb 2025
AMAZON WEB SERVICES
Compete Solutions Architect I
Designed end-to-end AI systems integrating AWS analytics, PyTorch-based training, feature stores, and scalable data engineering pipelines for generative, predictive, and agentic AI use cases.
Jul – Dec 2022
AMAZON WEB SERVICES
Associate Solutions Architect
Began technical enablement career at AWS — delivering architecture reviews, workshops, and competitive intelligence to field teams.
Aug 2021 – Jul 2022
CARRIER GLOBAL CORPORATION
Data Scientist
Built end-to-end Python ML models predicting condenser blockages and faults, deployed on Databricks. Collaborated on an agile Scrum team analyzing HVAC/R data and simulating failure modes.
Apr – Aug 2021
CARRIER GLOBAL CORPORATION
Software Engineer
Designed software representations of Container systems using Python and embedded C. Built automation and simulation tools to streamline IoT software testing.
Jun – Aug 2020
OMNICELL, INC.
Data Science / ML Intern
Built a text classification solution using Python, Spark, and SQL on Databricks, improving identification accuracy by ~30%. Delivered a production-ready MVP in under 3 months.

Education

SYRACUSE UNIVERSITY
M.S. Applied Data Science
2020 – 2021 · Focus: Human-Centered AI
SYRACUSE UNIVERSITY
MBA
2020 – 2021
SYRACUSE UNIVERSITY
B.S. Applied Mathematics
2016 – 2020
SYRACUSE UNIVERSITY
B.S. Economics
2016 – 2020
GET IN TOUCH

Let's Connect

Whether you want to collaborate, discuss AI strategy, or just say hi, I'd love to hear from you. Shoot me a message!

✦ MESSAGE SENT! I'LL BE IN TOUCH SOON.
DIRECT CONTACT
ONLINE
OPEN TO

Speaking engagements, collaborations, technical writing opportunities, building, and much more, so don't be shy!

WORK

Portfolio

Selected projects, case studies, and technical work.

🌍

Competitive Analysis SA Roadmap

A structured curriculum for mastering competitive analysis in AI/ML and Generative AI across AWS, Azure, GCP, and digital-native platforms.

▶ VIEW REPOSITORY

The Framework Wars Were the Warmup

A practical guide to building AI agent systems that survive production: framework comparisons, architecture layers, checklists, and anti-patterns.

▶ VIEW REPOSITORY
← BACK TO WRITING
AI & MODEL EVALUATION

Some Thoughts on Model Benchmarking and Use Cases

FEBRUARY 2026

I love an analogy so I’ll start that way: Let’s say a surgeon is hired, but based entirely on their performance on a written multiple-choice exam. They ace it (they memorized it overnight maybe?), and you hire them. In the operating room, things go not-so-great fast, because cutting people open turns out to involve a different skill set than filling in bubbles.

This is, more or less, what the AI industry is doing right now with model evaluation.

Two recent pieces of work make the case, from different angles, that the problem is worse than most people realize.

The first is a February 2026 academic paper, “The Necessity of a Unified Framework for LLM-Based Agent Evaluation,” from researchers at SUNY and the University of Illinois (arxiv.org/abs/2602.03238). Its core argument is simple and damning: we have no reliable way to compare AI agents against each other, because every research team tests them differently. Different prompts, different tools, different environments; what looks like a model getting smarter might just be a researcher writing a better setup. When you can’t isolate the thing you’re trying to measure, the number you get back isn’t a measurement. It’s a guess-timate.

The second is Artificial Analysis, an independent benchmarking organization testing every major AI model themselves, on their own hardware, using the same conditions for everyone (x.com/ArtificialAnlys). No submitted scores from the labs, no cherry-picking, and this matters because in late 2025 it came out that major AI labs had been submitting only their best results to public leaderboards, which (as you can probably “guess-timate”) inflated rankings by up to 100 points. Artificial Analysis’s approach, don’t let anyone grade their own homework, turns out to be the bare minimum standard for honesty.

In the end, both pieces of work land on the same uncomfortable truth. Even careful, independent testing can’t fully solve the problem the academic paper identifies, because the issue isn’t just who runs the tests, but what the tests are actually measuring. An AI agent isn’t a calculator that spits out the same answer every time. It’s more like an employee whose performance depends on how you manage them, what tools you give them, and what you ask them to do. Change the instructions slightly and you get wildly different results from the exact same model. The score doesn’t travel.

This is something I keep coming back to in my own work: the “best” model on a leaderboard is rarely the best model for your specific job. A smaller, faster, cheaper model with the right setup will outperform a frontier model with the wrong one almost every time. The benchmark tells you who won the standardized test. It says nothing about fit.

Despite the rambling, this does all mean something practically. Every AI benchmark score you see, including the careful independent ones, is a screening tool, not a prediction! It tells you which models aren’t worth your time. It does not tell you which model will work for you.

What the field actually needs is something medicine figured out a long time ago (and what we like to scare ourselves with when we start feeling something... not good... cough, cough, WebMD): and it’s published protocols. Every AI evaluation should come with a full description of exactly how it was run, the prompt, the tools, the environment, so results can be compared and trusted. Without that, the leaderboards are just marketing with better fonts.

The scores will keep climbing, the gap between the number and reality will keep widening, and people deploying these systems will keep learning the hard way that the report card and the job are two completely different things.


← BACK TO ALL POSTS
← BACK TO WRITING
AI ENGINEERING & ARCHITECTURE

The Framework Wars Were the Warmup

MARCH 2026

I’ve been thinking about kitchens.

Specifically, about the way people renovate kitchens. There’s this phase, right at the beginning, where you spend an absurd amount of time picking out the appliances. You go deep. You read reviews. You develop opinions about induction versus gas that you never asked to have. You get into arguments about refrigerator brands at dinner parties. And then the contractor shows up and says something like “so, about your foundation” and suddenly the sexy appliance debate feels very, very small.

That’s where AI engineering is right now. We’re in the appliance phase. And the foundation conversation is just starting to happen.

It started, for me at least, with the Google ADK versus LangGraph debate. Which, to be fair, is a real debate about real tradeoffs, and I don’t want to wave it away entirely because the distinction actually matters once you understand what each one is betting on. But I do want to put it in context. Because I think the context changes how you think about all of it.

Google ADK is the planned community of agent frameworks. You move in and the architecture is decided. The HOA exists. The landscaping is done and you’re probably not allowed to paint your mailbox a weird color. Multi-agent orchestration patterns come baked in: sequential agents, parallel agents, loop agents, you declare them and go. If you’re already living in the Google Cloud neighborhood (Vertex AI, Cloud Run, Firestore, BigQuery), the integrations feel seamless in the way that only a vertically integrated ecosystem can. The framework makes the decisions. You make the agents. And if your decisions happen to align with their decisions, you move genuinely fast.

LangGraph took the opposite bet. It hands you raw graph primitives, nodes and edges and state transitions, and basically says “good luck.” You’re buying the empty lot and building from scratch. But the empty lot comes with something the planned community can’t offer: full model agnosticism. Claude, GPT, Gemini, Llama, whatever shows up next quarter, all interchangeable without rewriting your orchestration layer. You get granular control over every single step your agent takes. And you get durable checkpointing.

I know. Checkpointing sounds like the most boring feature ever listed in documentation. But picture this: your agent is forty-seven steps into a complex workflow. It’s been pulling data, making judgment calls, coordinating with other services. And then it crashes. With most setups you’re restarting from zero. Forty-seven steps, gone, like they never happened. With durable checkpointing you resume from step forty-seven. The difference between those two outcomes is the difference between a system you can actually run in production and a system you demo once at a conference and then quietly stop bringing up.

So the choice between them comes down to constraints. Deep in Google’s ecosystem and want velocity? ADK. Need provider flexibility, fine-grained state control, or can’t stomach vendor lock-in? LangGraph. Most engineers pick based on whatever got the most engagement last week. The ones building things that survive past their first quarter in production pick based on what they literally cannot afford to get wrong.

And I was thinking about this, about frameworks and tradeoffs and which bet to make, when I saw what Palantir had been building. And the framework conversation suddenly felt like the appliance conversation. Because Palantir wasn’t arguing about which wrench to use. They’d gone ahead and designed the entire building.

Their AIP platform lays out an end-to-end agentic architecture. Twelve layers. The full production stack, from model integration at the bottom to enterprise automation at the top. And when you look at the blueprints, you start seeing all the load-bearing walls that the framework debate doesn’t even acknowledge exist.

Layer one is secure LLM integration, commercial and open-source models, all pluggable with zero data retention by the provider. Swap models without rebuilding anything. That principle alone puts them ahead of teams still hardcoding API calls to a single provider and hoping that provider’s pricing stays reasonable (it won’t, but that’s a different essay).

Layer two is end-to-end observability. Every tool call traced. Every data access logged. Every decision recorded. If you’ve ever tried to debug a multi-agent system at 2 AM (and I mean actually debug it, staring at logs, squinting at trace IDs), you know why this matters. “It just... did that” is the kind of postmortem that gets people reassigned.

Layer three is where I started paying closer attention. They call it context engineering: real-time contextual data, contextual logic, and systems of action, all feeding into something called the Ontology at layer four. And the Ontology is probably the most underappreciated piece of the whole architecture.

Think of it this way. Most AI agents operate like a new hire who’s been handed a laptop and a Slack login but has never seen the org chart, doesn’t know who reports to whom, has no idea what the company actually does day to day. They can process information, sure, but they have no context for any of it. The Ontology is the opposite of that. A live, structured model of how the business actually works: the people, the processes, the assets, the relationships between all of them, with a human-plus-AI decision framework layered on top. When an agent plugs into that, it stops being a text processor and starts being something that can reason about consequences. Something that can actually run a piece of your operation.

The remaining layers fill in everything else. Media and vector services and multimodal compute at layer five. Security and governance at layer six (role-based controls, purpose-based access, approvals, checkpoints, all the stuff that regulated industries demand before letting an agent anywhere near production, because a hospital or a bank will not deploy anything that can’t explain who authorized what and why). Agent lifecycle management at seven. Operational automation at eight. Development environments with IDE integration and MCP extensions at nine. Human-plus-AI applications at ten. Packaging and deployment at eleven. Enterprise automation at twelve.

Twelve layers. One architecture. And a very compelling argument that the framework debate is addressing maybe 10% of the actual problem.

Which brings me to the part I keep circling back to, the pattern underneath all of this.

If you zoom out far enough, past the framework arguments, past Palantir’s architecture diagrams, AI engineering in 2026 sits on four pillars. Most engineers are comfortable with two of them. Almost nobody is fluent in all four. And that gap is where systems go to die. Not dramatically, not in some spectacular failure that makes the news, but slowly, the way a building with a cracked foundation settles over time until one morning a door won’t close and nobody can figure out why.

Pillar one is agent foundations. The part everyone starts with, and honestly the part that gets the most attention, probably because it’s the most fun. This is where you learn to build agents that can actually handle ambiguity: reflection, tool use, planning, multi-agent collaboration. ADK and LangGraph both live here, but the pillar is bigger than either framework. Understanding what an agent is, architecturally, before you start importing libraries.

Pillar two is protocols and communication. And this is the one that keeps me up at night a little, because almost nobody is talking about it, and it’s the single biggest reason multi-agent systems don’t scale.

Two protocols matter: MCP (Model Context Protocol) and A2A (Agent-to-Agent). MCP is how agents connect to tools, databases, APIs, file systems, anything external the agent needs to touch. A2A is how agents connect to each other, how they discover one another, negotiate tasks, and hand off work without a human manually wiring every connection.

Here’s how I think about it. Imagine building a company where every employee is individually brilliant but there’s no email. No Slack. No meetings. No shared documents. Nothing connects anyone to anything. Every person does their job in a vacuum. That company would be chaos. That company is also what most multi-agent systems look like right now. Individually capable agents, zero communication infrastructure. Skip this pillar and everything breaks the moment you go from one agent to two. Palantir has protocol support threaded through multiple layers of their architecture, tool services, agent orchestration, MCP extensions in the dev environment, and that tells you they’ve already hit the scaling wall that most teams are still accelerating toward.

Pillar three is multi-agent orchestration. One agent is a demo. Multiple agents coordinating is a product. This is where you deal with the stuff that sounds simple in a blog post and turns out to be genuinely, teeth-grindingly hard in practice: cyclic reasoning (agents that loop back and refine their own output), shared memory across agents, failure recovery when one agent in a chain decides to go sideways.

And here’s the thing about orchestration that I wish someone had told me earlier: it’s a distributed systems problem. Full stop. It’s wearing an AI costume, sure, and the conferences have better snacks, but underneath it’s the same class of problem that has been punishing overconfident engineers since long before anyone was talking about large language models. And distributed systems punish you the same way every time. Quietly, at 3 AM, in production, in a way that takes four people and a whiteboard to untangle the next morning.

Pillar four is evaluation and governance. The one everyone skips. The reason agents die in production.

How do you know your RAG pipeline is retrieving the right context and not just the most recent context? How do you govern what an agent is allowed to do, and audit what it already did? How do you handle messy, unstructured data so the agent doesn’t hallucinate its way through a workflow and present garbage with total confidence? These aren’t glamorous problems. Nobody’s writing viral posts about data preprocessing pipelines or evaluation harnesses. But Palantir dedicates entire architectural layers to exactly this, observability, security and governance, evaluation suites, because they’ve already learned what most startups are about to learn the hard way: an agent that can’t be audited can’t be trusted. And an agent that can’t be trusted doesn’t survive contact with any environment where the stakes are real.

And once you see the four-pillar pattern, the whole landscape rearranges itself. The framework debate covers pillar one, maybe bleeds into pillar three. Important? Absolutely. But it’s a quarter of the picture. The teams that are actually shipping agentic systems that hold up under pressure are the ones that have all four pillars standing at the same time. And right now, that’s almost nobody.

So here’s the story, the way I see it, told all at once.

We spent two years building demos. Smart demos, fun demos, demos that made people lean forward in their chairs. And we argued, passionately, about the tools we used to build them. Google ADK or LangGraph. Opinionated defaults or flexible primitives. Speed or control.

Meanwhile, the teams that were actually putting agents into production kept running into the same realization: the agent was the easy part. The hard part was everything around it. The observability. The governance. The protocol layer that lets agents talk to tools and to each other. The evaluation framework that tells you whether your system is actually working or just confidently wrong. The security model that a regulated industry requires before it’ll let an agent near anything that matters.

The framework was the appliance. The architecture was the foundation.

And we’re just now, in 2026, starting to have the foundation conversation. Which is late, honestly. But better late than after everything’s already been built on sand.

The next two years are about systems. And systems, as it turns out, have a lot of layers.


← BACK TO ALL POSTS
← BACK TO WRITING
AI INFRASTRUCTURE & CLOUD STRATEGY

Speed Is the Next AI Moat, And the Partnerships Prove It

MARCH 2026

There is a moment in every technology cycle where the constraint everyone was ignoring becomes the only thing that matters. For a while it was data. Then it was talent. Then it was chips, generally. Now it is something more specific and, if you squint, more interesting: how fast a machine can finish its own thought.

Three announcements landed in the last few weeks, and read separately they look like standard-issue cloud partnership press releases, the kind where executives use the word “collaboration“ four times in a single paragraph and everyone pretends the stock price is irrelevant. Read together, though, they describe something bigger. They describe the moment the AI industry stopped arguing about which model is smartest and started arguing about which infrastructure can think fastest, at scale, without melting.

That shift, from model intelligence to inference velocity, changes who wins, how companies build, and what “AI strategy“ actually means at the executive level.

Let me explain.


Cerebras + AWS: The Case for Splitting the Brain

Today, literally today, AWS and Cerebras announced they are putting Cerebras CS-3 systems inside AWS data centers, accessible through Bedrock. This is notable on its own. Cerebras has been the speed outlier in AI hardware for years, but it operated in its own cloud, separate from the hyperscalers. Putting their silicon inside AWS is like a boutique restaurant opening a location in the airport. Suddenly the niche thing is available to everyone.

But the real story is the architecture they are building around it, which relies on a concept called inference disaggregation. And inference disaggregation, despite sounding like something a consultant made up to justify a slide deck, is actually an elegant idea that solves a problem most people do not know exists.

Here is the problem. When you send a prompt to a large language model, two very different things happen in sequence. First, the system processes your entire input: every token of your question, your context, your document, whatever you fed it. This is called prefill. It is computationally intense, embarrassingly parallel (meaning you can throw hundreds of processors at it simultaneously and they all stay busy), and it loves raw processing power. Think of prefill as reading an entire book at once. The more eyes you have, the faster it goes.

Then comes decode: the model generates its response, one token at a time, sequentially, like a person writing a sentence longhand. You cannot parallelize this because each word depends on the one before it. Decode is memory-bandwidth-hungry rather than compute-hungry in the traditional sense. The model needs to pull billions of parameters from memory thousands of times per second to figure out what the next token should be, and the speed at which it can do that determines how fast your answer arrives. Think of decode as a court stenographer: the bottleneck is how fast their fingers move across the keys and how quickly they can reference their shorthand, not how fast they think.

Here is the critical insight: these two jobs have almost opposite hardware requirements. Prefill wants lots of compute cores running in parallel. Decode wants massive memory bandwidth with low latency. Running both on the same chip is like hiring one person to be both your architect and your bricklayer. Technically possible. Never optimal. One side is always waiting on the other.

What AWS and Cerebras are doing is splitting the work. Trainium, Amazon’s custom AI chip designed for heavy parallel computation, handles prefill. The Cerebras CS-3 handles decode. The CS-3 is a genuinely unusual piece of hardware: it is built around a wafer-scale engine that stores entire model weights in on-chip SRAM rather than in external memory. This means it does not have to go off-chip to fetch parameters during decode. It already has them. The memory bandwidth advantage over traditional GPUs is not incremental. Cerebras claims it is thousands of times greater. Amazon’s Elastic Fabric Adapter network connects the two systems with low-latency, high-bandwidth links so the handoff between prefill and decode happens seamlessly.

The claimed result: an order of magnitude faster inference. Five times more high-speed token capacity in the same hardware footprint.

Now, why should anyone outside of a data center care?

Because of what is happening to AI workloads right now, in real production environments. The era of “ask a chatbot a question and wait for a paragraph“ is already fading. What is replacing it is agentic AI: systems that do not just answer questions but execute multi-step tasks autonomously. An AI coding agent does not generate one response and stop. It writes code, runs it, reads the error, rewrites the code, runs it again, checks the test suite, refactors, and commits. Each iteration generates output tokens. Cerebras estimates that agentic coding workflows produce roughly 15x more output tokens per query than a standard conversational exchange. Reasoning models that “think“ before answering by generating long internal chains of deliberation make this even worse. They might produce 500 tokens of silent reasoning before they ever show you a single word of their actual answer.

In that world, decode speed leaves the backend monitoring dashboard and becomes something your developers feel every time they trigger an agent and wait. Something your customers experience when an AI assistant takes eight seconds to respond instead of one. The difference between an agentic workflow that feels like magic and one that feels like waiting in line at the DMV. The models have gotten smart enough. Now the question is whether the pipes can keep up.

The question enterprise leaders should be asking has moved well beyond “which model are we using?“ and toward “how fast can intelligence move through our company?“ The velocity of intelligence through an organization is a boardroom question wearing engineering clothes. The companies that figure it out first will compound their advantage, because faster inference means faster iteration, which means faster learning, which means better products. Speed is a flywheel, and it starts at the silicon layer.


AWS + OpenAI: The Deal Behind the Deal

Two weeks before the Cerebras news, Amazon invested $50 billion in OpenAI. This was part of a $110 billion funding round, the largest private investment in history, that also included SoftBank and NVIDIA and valued OpenAI at $730 billion, which is a number that stops feeling real if you look at it for too long.

But the money, enormous as it is, is the least interesting part of this deal. Fifty billion dollars is what it costs to get a seat at the table. The interesting part is what is on the table: the architecture of the partnership itself, which carves up the AI cloud stack in a way that has very specific, very deliberate consequences for every company building on top of it.

Three things happened that matter.

The Stateful Runtime Environment. AWS and OpenAI are co-building a new kind of AI execution layer, available through Bedrock. To understand why this is significant, you need to understand a distinction that sounds technical but is actually the most important architectural divide in enterprise AI right now: stateless versus stateful.

A stateless API call is a one-shot transaction. You send a prompt, you get a response, the system forgets you exist. There is no memory, no continuity, no awareness that you asked it something five minutes ago. It is like calling a help desk where every agent is new and has never heard of you. This is what most AI products do today, and it works fine for simple tasks: summarize this document, translate this sentence, answer this question. Microsoft’s Azure retains exclusive rights to host these stateless API calls for OpenAI’s models. Even if a third party builds something using OpenAI and AWS, the stateless calls still route through Azure. That is locked in.

A stateful environment is fundamentally different. It remembers. It maintains context across sessions. It can pick up where it left off. It coordinates across multiple tools and data sources. It persists identity, meaning it knows who it is working for, what it has already done, and what it is supposed to do next. The gap between stateless and stateful is the gap between asking a stranger for directions and having an employee who knows the building, remembers last week’s meeting, and can log into your CRM.

Enterprise agent systems, the kind that do sustained, multi-step work inside a company like managing a sales pipeline or triaging support tickets or orchestrating a deployment, need stateful architecture. Without it, every interaction starts from zero, and you spend more compute re-establishing context than you do on the actual work. AWS just secured the right to build that stateful layer with OpenAI, co-developing the environment that will power the next generation of persistent AI agents for enterprise customers.

This is a major structural deal. AWS is positioning itself as the platform where enterprise agents live and persist, while Azure remains the platform where one-off API calls happen. The difference between those two things is the difference between selling someone a phone call and selling them a phone line.

OpenAI Frontier distribution. AWS is now the exclusive third-party cloud distribution provider for OpenAI Frontier, the enterprise platform OpenAI launched in February for deploying and managing teams of AI agents. Frontier is not a model. It is an operating layer. It connects AI agents to a company’s data warehouses, CRM systems, internal tools, and business processes. It handles governance, security, shared context, and access controls. Think of it as the HR department for AI agents: it onboards them, gives them the right permissions, makes sure they follow the rules, and coordinates their work.

Microsoft cannot sell Frontier to enterprises. AWS can. And if you believe the consensus view, that the next wave of enterprise value comes from deploying fleets of AI agents that operate semi-autonomously inside organizations, then distribution rights for the agent management platform matter far more than distribution rights for any individual model. Models are the engine. Frontier is the vehicle. AWS just got exclusive rights to sell the vehicle.

The Trainium commitment. OpenAI expanded its existing $38 billion AWS compute agreement by another $100 billion over eight years and committed to consuming 2 gigawatts of Trainium capacity, spanning both the current Trainium3 and the next-generation Trainium4 expected in 2027. To put 2 gigawatts in context: that is roughly the power output of two large nuclear reactors, dedicated entirely to running OpenAI workloads on Amazon’s custom silicon.

This matters because Trainium has been Amazon’s big bet against NVIDIA. For years, the market was skeptical. NVIDIA’s CUDA software ecosystem is so deeply entrenched that switching away from it requires rewriting significant portions of a codebase. Anthropic training Claude on Trainium was the first major validation. But Anthropic is financially entangled with Amazon (Amazon has invested at least $8 billion), so skeptics could dismiss it as a captive customer. OpenAI is not captive. It has deals with Microsoft, Broadcom, AMD, and NVIDIA. It could run anywhere. Choosing to stake 2 gigawatts on Trainium, with 900 million weekly active users depending on the output, is a bet with real downside if the silicon underperforms. That is the kind of validation money cannot buy. Or rather, it is the kind of validation that $100 billion in committed spend does buy, which is the point.

The territorial split could not be cleaner. Azure keeps stateless. AWS gets stateful and enterprise agent distribution. OpenAI gets compute diversity so no single provider can hold it hostage. None of the three clouds “won.“ The AI stack got partitioned along architectural fault lines, and OpenAI concluded, correctly, that the era of monogamous cloud relationships is over.


Microsoft + OpenAI: The Divorce That Isn’t (Except It Kind Of Is)

And then there is Microsoft, which is doing what any rational actor does when their partner starts seeing other people: getting really serious about self-improvement.

In February, Mustafa Suleyman, Microsoft’s AI chief, DeepMind co-founder, and the person now tasked with making sure Microsoft’s entire product line does not depend on a company it cannot control, told the Financial Times that Microsoft is building its own frontier-grade foundation models. He called it “true AI self-sufficiency.“ Not “complementary capabilities.“ Not “strategic optionality.“ Self-sufficiency. The language was chosen carefully. It means what it sounds like it means.

Here is what Microsoft is building. The company is developing its MAI model family, starting with MAI-1, a roughly 500-billion-parameter model trained on a 15,000 H100 GPU cluster, with plans for much larger training runs on dedicated infrastructure. It is constructing gigawatt-scale training clusters and building the Maia 200, its second-generation AI accelerator chip, specifically targeting inference economics (the cost of generating each token at scale). It is assembling what it describes as “some of the very best AI training teams in the world.“ And it is planning to ship frontier-class models, competitive with the best from OpenAI, Anthropic, and Google, sometime this year.

Simultaneously, Microsoft has been quietly diversifying its model portfolio on Azure. It now hosts models from xAI, Meta, Mistral, Black Forest Labs, and Anthropic. Reports surfaced that Microsoft was using Anthropic’s Claude inside Microsoft 365 Copilot for certain Office tasks after internal testing showed it outperformed OpenAI’s models in those specific contexts. Let that sink in: Microsoft was paying AWS for access to a competitor’s model to power its own flagship productivity software, because the competitor’s model worked better for those tasks. When you are running A/B tests between your partner’s product and their rival’s product inside your own software, “partnership“ is doing a lot of semantic heavy lifting.

The groundwork for this was laid in October 2025, when OpenAI restructured into a public benefit corporation. That restructuring was the quiet earthquake. Microsoft came out of the negotiation with a 27% stake (valued at roughly $135 billion), IP rights to OpenAI’s models through 2032 (including post-AGI models, with AGI determinations now made by an independent panel rather than OpenAI’s board, a point Microsoft fought hard for because the original deal let OpenAI unilaterally declare AGI and cut Microsoft off), and a $250 billion Azure services commitment from OpenAI. What Microsoft gave up was cloud exclusivity, the concession that made the AWS deal possible two months later.

What the restructuring actually did was convert a marriage into a joint venture with clearly defined exit ramps. Microsoft still benefits enormously if OpenAI succeeds. It owns 27% of the company and collects revenue share on everything OpenAI earns, including revenue generated through the AWS partnership. But it is no longer architecturally dependent on that success. It has models of its own in development. It has alternative model providers hosted on Azure. It has custom silicon for inference. If OpenAI stumbled tomorrow, Microsoft could keep Copilot running.

And there are real reasons to build that contingency. OpenAI’s annual burn rate is approaching $90 billion. It has over a trillion dollars in aggregate compute commitments it may or may not be able to honor. It is facing high-profile copyright litigation from the New York Times and others. It removed the word “safely“ from its mission statement during the restructuring. And it is building a product, an agentic coding platform evolving out of Codex, that competes directly with GitHub, Microsoft’s $7.5 billion developer platform that was, until very recently, powered by OpenAI’s own models. When your technology partner starts building products that compete with your products, using the technology you are paying them to provide you, the incentive to develop alternatives becomes more than theoretical.

The rational move is to hedge. Microsoft is hedging. Aggressively, politely, and with a press strategy that involves saying “OpenAI has a huge role for us“ in one sentence and “we are building frontier models for specific things“ in the next. Both statements are true. Neither tells the whole story. The whole story is that Microsoft looked at the dependency graph, did the math on what single-provider risk looks like when the provider has never turned a profit and is simultaneously trying to raise more money than some countries’ GDP, and decided to build a parallel path. That is not a betrayal. That is procurement.

OpenAI, for its part, is not being left. It is leaving first, or at least leaving simultaneously. The AWS deal, the Cerebras partnership, the SoftBank billions, the Broadcom custom ASIC program, the AMD commitment: OpenAI is locking in compute from every direction because it has to. The models keep getting bigger, the agents keep getting hungrier, the reasoning chains keep getting longer, and no single cloud can feed the machine alone. OpenAI’s multi-cloud strategy is not philosophical. It is caloric. The organism needs more energy than any single host can provide.


What This Actually Means (For People Who Build Things)

Pull the camera back and these three stories are one story. The AI infrastructure layer is fracturing, not chaotically but strategically, along lines that will determine who builds what for the next decade. Speed is becoming a differentiator you can measure in tokens per second and feel in developer productivity. Model access is going multi-cloud by default, driven by necessity rather than preference. And the partnerships that looked permanent eighteen months ago are being renegotiated in real time by companies that have each concluded, independently, that depending on one partner for the foundation of your AI strategy is an unacceptable risk.

For the people actually making decisions, building products, choosing platforms, setting AI strategy inside organizations that have to ship things and answer to customers, four things are now true that were not true a year ago:

Speed is the bottleneck that determines everything downstream. If your agentic workflows are producing 15x the tokens of standard chat and your infrastructure cannot keep pace, your agents are only as fast as your slowest chip. This shows up in developer wait times, in customer-facing latency, in the number of agentic iterations you can run per hour, in the cost per task. The Cerebras + AWS disaggregated architecture is the first serious attempt to solve this at cloud scale: purpose-built hardware for each phase of inference, connected by high-speed networking, available through the same Bedrock console enterprises already use. Pay attention to what happens to cycle times and cost-per-token in organizations that adopt it versus those still running everything on general-purpose GPUs. The gap will not be marginal.

Stateful beats stateless for anything that actually matters in the enterprise. A chatbot can be stateless. An agent cannot. If your AI systems need to remember what they did yesterday, coordinate across tools, maintain identity, and build on prior context, and if you are building anything beyond a simple Q&A interface they do, then you need a stateful runtime. The AWS-OpenAI Stateful Runtime Environment and Azure’s stateless API model represent different philosophies of how AI fits into an organization, not just different products. The platform decision you make now constrains what you can build for years. Choose accordingly.

Model lock-in is already dead. Act like it. Microsoft is building its own frontier models. OpenAI is running on both Azure and AWS. Anthropic trains on Trainium and is hosted on Azure. Google has Gemini on its own TPUs. Meta gives its models away for free. The idea that you pick one model provider and build your stack around it, “we are a GPT shop“ or “we are a Claude shop,“ is a strategy that will age like milk. The organizations that will win are the ones treating models as swappable components and investing in the orchestration layer above them: the routing, the evaluation, the fallback logic, the abstraction that lets you swap a model without rewriting your application. The model is the commodity. The system around it is the asset.

Follow the silicon, because silicon follows strategy. When both Anthropic and OpenAI commit gigawatts to Trainium, and AWS pairs it with Cerebras for decode, and Microsoft builds Maia for inference economics, the custom silicon story has moved from speculative to structural. NVIDIA is not going anywhere. CUDA’s ecosystem lock-in is real and deep. But the supply-constrained world we live in, where demand for AI compute outstrips supply by multiples and a single company controlling the bottleneck is a risk every hyperscaler is now actively mitigating, means the chip landscape in 2027 will look nothing like it did in 2024. If your AI roadmap assumes NVIDIA GPUs are the only option, your contingency plan has a gap in it that your competitors are already filling.

The era of cozy, exclusive AI partnerships is ending. What is replacing it is messier, more competitive, and, for the people actually building on top of all this, ultimately better. The infrastructure is fracturing so the products can consolidate. The plumbing is getting complicated so the experience can get simple. The giants are hedging against each other so the builders can hedge against the giants.

Intelligence is powerful. Speed is strategic. And the companies that understand both are already setting the pace for the ones that do not.

What is your organization optimizing for: model quality, inference speed, or deployment flexibility? I would argue the answer has to be all three now. The window where you could pick one and defer the others is closed.


TLDR

AWS and Cerebras are splitting inference into two specialized stages (prefill on Trainium, decode on CS-3) to push token output speed up by an order of magnitude, which matters because agentic workloads generate 15x more tokens than standard chat. Amazon invested $50 billion in OpenAI and secured exclusive rights to distribute OpenAI’s enterprise agent platform (Frontier) and co-build a Stateful Runtime Environment through Bedrock, while Azure retains stateless API exclusivity, effectively carving the AI cloud stack in two along architectural lines. Meanwhile, Microsoft is building its own frontier-grade models (MAI), its own inference chip (Maia 200), and quietly running Anthropic models inside Copilot where they outperform OpenAI, because when your partner is burning $90 billion a year, competing with your own GitHub, and signing mega-deals with your biggest cloud rival, you stop calling it a partnership and start calling it a hedging strategy. The takeaway for enterprise leaders: speed is now a strategic bottleneck, stateful architecture is where agents actually live, model lock-in is dead, and custom silicon has gone from speculative to structural. Pick your platforms accordingly, because the days of choosing one model, one cloud, or one chip and calling it a strategy are over.


← BACK TO ALL POSTS