Inside the Tooling Choices of Twelve Frontier AI Teams

We try to publish at least one survey piece per issue. This is the one for Q2. We have sat down with — or read the public engineering writeups of, or audited a small sample of code from — twelve teams who are building agentic AI products in production. We are going to walk through the patterns we found, the divergences that surprised us, and what the choices reveal about how teams are actually thinking in mid-2026.

We are anonymizing the teams. Specific company names in a tooling survey produce noise of two kinds: vendor relationships that color the writeup, and reader assumptions about teams the writeup is not really about. The patterns are more honest than the names. What we are publishing is the aggregate picture, with specific divergences called out where they teach something.

We are going to skip the percentages, as is our habit, because we did not measure across enough teams to claim them. “Eight of twelve teams” means eight of twelve in our cohort, not eight of twelve in the market. Treat the numbers as illustrative.

The model layer: two-tier, with a smaller third tier on hot paths

Twelve of twelve teams are running a multi-model setup. The shape that is now nearly universal is a frontier model from a top lab for reasoning-heavy work, a smaller and cheaper model from the same or a different lab for high-volume classification and routing, and — increasingly — a small self-hosted or low-latency model for the absolute hottest paths.

The interesting divergence is in the second and third tiers. The cheaper model in tier two is, in most cases, from the same lab as the frontier model. Teams have decided the operational simplicity of one vendor relationship outweighs the marginal cost savings of mixing. The third tier, where it exists, is more often self-hosted because the use case demands either privacy or sub-100ms latency. Several teams told us they reached for a self-hosted model to handle a single high-volume routing call and would not bother for anything less hot.

What the choice reveals: the model layer has fully commoditized for most use cases. The interesting work has moved up the stack.

The orchestration layer: opinionated rather than off-the-shelf

Ten of twelve teams have written their own orchestration on top of an open-source library or a platform. Two of twelve are running directly on what their chosen platform provides without significant customization.

The pattern in the ten-of-twelve case is consistent. The team picks an open-source orchestration library (one of the major three or four), writes a thin layer that captures the team’s vocabulary for agents and handoffs, and treats the library as a primitive rather than an architecture. The reasoning we heard repeatedly: “the library moves faster than our product, so we have to be able to upgrade independently.”

The two-of-twelve case is more interesting. Both teams are running on packaged agentic workforce platforms. Both teams told us their reasoning was that the platform’s opinions are good enough that customizing them was not worth the engineering. Both teams ship more product per engineer than the ten-of-twelve average, on the same kind of work. The sample is small but suggestive.

What the choice reveals: orchestration is where the platform-vs-stitched argument is being decided right now. Teams that pick a strong platform and accept its opinions move faster. Teams that pick primitives and customize get more control over the details. Both are defensible.

State and memory: relational by default, vector for RAG only

Eleven of twelve teams have a Postgres or equivalent relational store as the durable memory of their system. The twelfth team is on a managed key-value store, which they characterized as “the wrong choice we have not yet corrected.”

The pattern around vector databases has clarified meaningfully. Vector stores are used for retrieval-augmented generation over unstructured corpora. They are not used as the system’s primary memory of what the user said or what the engagement is about. Several teams told us they had moved their “memory” out of vectors and into Postgres in the last twelve months, and that the move had cut their bug rate in this area.

The ephemeral memory layer — scratch state during a single task — is varied. Redis is the most common. A few teams are using the orchestration library’s built-in session object. Two are using a custom store backed by an object store, which we would not recommend in general but seems to work for their specific case.

What the choice reveals: the early-2024 vector-DB hype has fully cooled in production. The discipline of typed durable state in a relational store has won.

The tool layer: MCP, with holdouts

Seven of twelve teams are using MCP for at least one tool integration. Of those seven, four are using MCP for substantially all of their integrations. Three are using it for one or two integrations and rolling their own for the rest. Five teams are not using MCP at all.

The non-MCP teams split into two camps. The first camp ships against a single external integration (their own customer’s API) and considers MCP overkill for one integration. The second camp evaluated MCP, decided the indirection cost was too high for their architecture, and rolled a custom JSON-RPC layer. Both are defensible choices, though the second camp is one whose architecture we would want to look at carefully before agreeing the choice was right.

The MCP-using teams’ biggest open question is operational: how to deploy and monitor MCP servers in production. We have written about this in our MCP-in-anger piece and will not repeat the lessons here.

What the choice reveals: the protocol layer is consolidating around MCP, but it is not universal yet. The operational tooling for MCP servers is the gap.

Observability: under-invested, but improving

Six of twelve teams have an observability layer that captures every LLM call with full prompt-and-response detail. The other six have partial capture — usually high-stakes calls only, or a sampled subset of all calls. None of the twelve have less than some kind of structured logging of LLM calls; the era of “we just log to stdout” appears to be over.

The teams with full capture are predominantly the ones in regulated industries or with regulated-industry customers. The teams with partial capture are the ones who have weighed the cost of storage against the value of the data and decided the data was not worth keeping at full fidelity. Both choices are reasonable. The teams who do not know which choice they are making are the ones who will be surprised by a production incident.

The tools used here are varied. Several teams are using purpose-built LLM observability platforms. Several are using general-purpose observability tools with custom integrations. A few are running their own capture pipeline into a data warehouse, which the team’s data engineer characterized as “the worst part of my job.”

What the choice reveals: observability for LLM calls is mature enough that no team has to build it from scratch, but immature enough that no team is fully happy with what they have.

Eval harnesses: present, but underused

Eight of twelve teams have an eval harness that runs on every PR. The other four have evals that run on a slower cadence — nightly, or pre-release, or manually before deploys. None of the twelve have no evals at all.

The depth of the eval suites varies more than the presence of the harness. The teams whose evals catch real regressions have suites that include both deterministic checks (did the system call this tool, did the output validate against this schema) and judgement-based checks (does the output match this rubric, does a judge model approve it). The teams whose evals do not catch real regressions have suites that are dominated by judgement-based checks alone, which tend to be noisy.

The discipline that distinguishes teams that get value from evals is unglamorous: they add an eval for every production bug they catch. The eval suite grows with the team’s accumulated bug history. The teams that do not add evals when bugs happen end up with eval suites that do not reflect the actual failure modes of the system.

What the choice reveals: the eval ecosystem has matured. The discipline of using it well is still uneven across teams.

Human-in-the-loop primitives: where the platform path wins

Five of twelve teams have a card-based human surface for reviewing agent outputs. Four have chat-based surfaces. Three have a hybrid where some surfaces are card-based and others are chat-based.

The five with card-based surfaces uniformly told us they wished they had built it from day one. The four with chat-based surfaces uniformly told us they were planning a migration to a card-based surface. The three with hybrid surfaces told us the chat-based parts of their UI were the parts they got the most complaints about from the human operators.

This is the slot where the agentic workforce platforms are most clearly ahead of the stitched-stack teams. The teams running on platforms with card surfaces as a primitive ship faster on this dimension; the teams writing their own surface end up rebuilding the same primitives badly. We have written about this elsewhere; the survey confirms it.

What the choice reveals: the human surface is the slot where the largest amount of unappreciated work happens. The teams who have invested in it are happy; the teams who have not are slowly realizing they will need to.

Deployment: boring, on purpose

The deployment stories are remarkably consistent. Eleven of twelve teams deploy out of GitHub. Eight of twelve use Railway or Render or Fly for the long-running services. The rest are on a mix of cloud-provider PaaS offerings or on Kubernetes (the two teams on Kubernetes are both on it because their company-wide policy requires it, not because they chose it for the agentic stack).

The common pattern is: monorepo in GitHub, CI on GitHub Actions, deploys to a PaaS for the services, a managed Postgres for the durable state, a managed Redis for the ephemeral state, and a managed object store for files. Almost nothing exotic. Almost no shared infrastructure across teams beyond the obvious managed services.

What the choice reveals: deployment is a solved problem for agentic stacks in 2026, as long as the team has the discipline to not over-engineer it.

The pattern under the patterns

If you read this survey as a single story, the story is that the agentic stack is consolidating around a small set of conventional choices, and the teams that ship are the teams that have made those choices early and stopped relitigating them. The interesting work is happening upstream — in how teams configure the orchestration, in how they structure their evals, in how they design their human surfaces, in how they organize their human team to operate the system. The infrastructure is increasingly boring. The org chart is increasingly the differentiator.

The teams in our cohort who are most clearly ahead of the curve are the ones whose engineering investment is most concentrated on the platform layer rather than on the integration layer. The teams who are still gluing every layer of their stack from primitives are doing the work the consolidating ecosystem will let them stop doing. Whether they realize it yet is a separate question.

The fully-converged version of this pattern — a team that ships its agentic platform as a product and also runs an agency on top of it — has emerged at a handful of shops over the last year. It is one of several patterns converging on the platform-as-default model.

We will keep doing this survey. If your team would like to be included in the next round — anonymously or otherwise — the contributors page has the contact. The patterns are more interesting when more teams contribute to the picture.

— The Editorial Team