Vibe Coding for Teams — From Karpathy's Tweet to Production

I want to start by saying what Andrej Karpathy actually said, because eighteen months of industry coverage have turned the original tweet into a Rorschach test. In February 2025 he wrote a short post coining “vibe coding” — letting an AI agent write most of the code while the human stayed at the wheel by intention rather than by line. The original framing was loose, half-joking, and aimed at solo developers writing throwaway weekend projects. The framing now in mid-2026 — entry-level postings asking for “vibe coding developer skills,” industry magazines like vibecoding.app and blog.mean.ceo running weekly issues, hiring titles like “Vibe Growth Marketing Manager” at Ramp — is a long way from the tweet.

The distance has not been kind to the discipline. Karpathy himself spent most of 2026 publicly worrying that the result was “slop” — confidently wrong code shipped by people who could not have written it themselves and could not now read it. He is right about a subset of the output and wrong about the category. The teams that do vibe coding well in 2026 produce shippable, maintainable software at a pace that would have been impossible two years ago. The teams that do it badly produce slop. The difference is workflow, not tooling.

This is the workflow piece. What vibe coding looks like inside an engineering team that ships, written from inside two teams that have been running this discipline for over a year.

What “vibe coding” means in 2026, in practice

The working definition is narrower than the magazines imply and wider than the original tweet. Vibe coding is the discipline of driving an agent-shaped tool by intent and review, not by line-by-line editing. The engineer specifies what should happen. The agent produces the code. The engineer reviews, redirects, and ships. There are three load-bearing parts of that definition: the intent is precise, the agent is the one writing, and the review is rigorous.

Compare three workflow shapes you will see in the wild.

The traditional shape is line-by-line editing with autocomplete. The engineer writes the code; the autocomplete suggests the next token; the engineer accepts, rejects, or types over. This is the Copilot-circa-2023 shape. The agent is a junior copy-editor.

The slop shape is intent-and-accept. The engineer says “build me a thing”; the agent produces five hundred lines; the engineer runs it, sees it works on the happy path, and ships. This is the shape Karpathy is right to worry about. The agent is the senior engineer; the human is rubber-stamping.

The disciplined vibe coding shape is intent-then-direct-then-review. The engineer specifies the intent in enough detail that they could have written it themselves; the agent drafts the code; the engineer reads every line, redirects on architectural decisions, accepts the mechanical work, and runs the tests they themselves wrote. The agent is doing the typing. The human is doing the engineering.

That third shape is the one the working teams use. It is also the one the magazines undersell, because “intent-then-direct-then-review” does not produce a viral tweet. It does produce shipped software.

The job-posting question — is this a real skill or a meme?

There are now real job postings that list “vibe coding” as a required skill, often at the entry level. The cynical read is that the requirement is a meme. The disciplined read is that hiring managers have noticed something genuinely shifted in the productivity distribution of new engineers and are trying to write down what they noticed. The trouble is that they are writing it down in the language of the most viral version of the phenomenon, which dilutes it.

What the working hiring managers actually want, when they post a “vibe coding skills required” requisition:

The engineer can decompose a feature into agent-sized tasks. This is the most undertaught skill in the discipline. A task is agent-sized when the intent is precise enough to specify in two paragraphs, the surface area is bounded enough that the agent can read the relevant code in a single context window, and the success criterion is something a test or a human can check. Engineers who can do this naturally are roughly four times as productive with agents as engineers who cannot.

The engineer can read code they did not write. This is the rate-limiter on the whole discipline. An engineer who can read three hundred lines of agent-produced code and tell you, within five minutes, where the bugs are can drive an agent at full speed. An engineer who cannot will produce slop at full speed.

The engineer knows when to override the agent. The disciplined version of vibe coding involves rejecting the agent’s first suggestion frequently. The slop version accepts it. Hiring managers are looking for the former — engineers whose review instinct fires reliably on the wrong-shaped solution.

The engineer maintains a discipline around tests, evals, and source control. Vibe coding without tests is slop. Vibe coding without evals is slop at the system level. Vibe coding without source-control discipline is slop you cannot back out of. The teams that produce maintainable agent-driven code are the teams that have not let the agent’s velocity erode the surrounding hygiene.

A reasonable job posting in 2026 would list those four bullets and avoid the phrase “vibe coding” entirely. The fact that the phrase is showing up in entry-level requisitions is partly fashion and partly a real signal that the discipline has consolidated into something hireable.

A working workflow, written down

I am going to write down the workflow two teams I have spent time with actually use. Both teams are five to fifteen engineers. Both ship to production weekly. Both have been running this discipline for over a year and have a sense of what works and what breaks.

The workflow starts with a task brief. The brief is the engineer’s intent, written down in two or three paragraphs, in a markdown file or a ticket. The brief specifies: what the change should accomplish, where in the codebase it should sit, what existing patterns to follow, what to explicitly not do, and the test or eval that will tell us the work is done. The brief is the load-bearing artifact. A vague brief produces vague code. A precise brief produces precise code. Teams that skip this step or try to specify intent inside the agent conversation produce slop at twice the rate of teams that write the brief first.

The brief then becomes the agent prompt. The engineer hands the brief to a coding agent — usually Claude Code in the teams I am writing about, occasionally Cursor’s agent mode for IDE-native work — and tells the agent to produce a plan. The plan step is non-negotiable. The agent produces a plan; the engineer reads the plan; the engineer either accepts it, redirects it (“no, do not refactor the auth module, just add the field”), or rejects it entirely and rewrites the brief. The plan step is where the engineer’s architectural instinct earns its keep.

# A typical Claude Code session start, lightly edited
$ claude
> read briefs/2026-05-22-add-org-billing-cap.md
> read the relevant files and produce a plan, do not edit anything yet

# Agent produces a plan as a numbered list of file changes.
# Engineer reads, redirects.

> step 3 is wrong — the cap belongs on the Organization model not on the Billing model.
> redo the plan with that change.

# Engineer reviews revised plan, approves.

> proceed. run the test suite after each step.

The agent then executes. This is the part that looks the most like the slop shape from the outside and is the least like it inside. The engineer is not idle. They are reading the diff as it appears, watching the test output, and stopping the agent when something looks wrong. The agent is doing the typing; the engineer is doing the steering. A good session has the engineer interrupting the agent three or four times in the course of a one-hour task. A slop session has the engineer interrupting it zero times.

When the agent finishes, the engineer reviews the entire diff by hand. Every line. This is the step that distinguishes the disciplined version of the workflow from the slop version. If you skipped this step on hand-written code, you would be a junior. If you skip this step on agent-written code, you are producing slop at higher velocity. The review is non-negotiable.

The engineer runs the tests — the existing test suite, the new tests the brief specified, and any evals the team maintains for the relevant subsystem. If anything fails, the engineer either fixes it themselves or hands it back to the agent with a precise correction. If everything passes, the engineer commits. The commit message is written by the human or written by the agent and edited by the human. Either way the human is the one putting the commit on the wire.

Finally, the engineer opens a PR and the PR goes through normal code review. The reviewer does not need to know whether the code was agent-written or human-written. The reviewer reviews the code on its merits. If the code looks rushed, sloppy, or wrong, the reviewer kicks it back. The agent-written shape is sometimes detectable, but a well-driven agent produces code that reads like a competent engineer wrote it. The detection question is also the wrong question. The right question is whether the code is good.

That is the workflow. It is not complicated. It is also not the slop shape, and the difference is mostly that the engineer never stops being the engineer.

The infrastructure that makes the workflow work

There is an infrastructure layer that the workflow above assumes. Teams that have it ship at the velocity the agent-tool marketing implies. Teams that do not have it produce slop and blame the tool.

A test suite that actually runs and actually catches regressions. This is the load-bearing piece. The agent’s incentive is to produce code that runs; the test suite’s job is to assert that the code does the right thing. A team whose tests run in two minutes, cover the load-bearing paths, and fail loudly when something regresses can let the agent run at full speed. A team whose tests are slow, flaky, or missing on the parts of the system that matter cannot. Before adopting vibe coding as a team discipline, audit the test suite. Fix what is broken.

An eval harness for the parts of the system that have non-deterministic outputs. If your system has LLM calls in it — and in 2026, most systems do — the test suite cannot tell you whether the LLM is producing the right kind of output. That is what an eval harness is for. Without it, the agent can introduce subtle regressions in your system’s behavior that nobody will catch until a user complains. This is doubly true of code that itself wraps an LLM.

Source-control hygiene. Every agent session ends with a commit. Every commit is reviewable. Every PR is small enough to fit in a reviewer’s head. Teams that let the agent produce one-thousand-line PRs are producing slop by definition. The discipline of small, reviewable commits is the discipline that lets you back out of a bad agent session without losing the day.

A house style that the agent can read. The agents are good at following a CLAUDE.md or an AGENTS.md or a .cursorrules file. If you tell them what your conventions are, they will mostly follow them. If you do not, they will mostly produce code in whatever style the model was trained on, which is approximately “the average of GitHub.” Most teams have a house style that is sharper than the GitHub average and should write it down for the agent. This pays back in less review time per agent-written PR.

An MCP server or two for the integrations the agent will need. When the agent needs to look up an issue in Linear, post a message to Slack, or read a runtime metric from Datadog, the path from the agent to the integration matters. MCP is the protocol that lets the agent do this without a brittle wrapper. The teams I am writing about have invested in MCP servers for their internal tools and treat them as a piece of developer infrastructure. The investment pays back, in the same way that good internal tooling has always paid back.

Where the slop comes from

A diagnostic. If your team is producing agent-written code and the code is slop, the cause is almost always one of these.

The brief was vague. The engineer told the agent “fix the billing bug” and the agent did its best. Its best was wrong. The fix is to write down the actual intent — what changes, where, why — before launching the agent. Two paragraphs of brief produces twenty minutes of useful work. No brief produces an hour of slop.

The engineer did not read the diff. The agent produced two hundred and fifty lines; the engineer scanned the first twenty, saw they looked plausible, and merged. The fix is to read the entire diff. Every line. If the diff is too long to read, the task was too big. Break it up.

The tests do not actually cover the regression. The agent produced code that passes the tests; the code is also wrong; the tests did not assert the thing that matters. The fix is to add the test that would have caught the regression. This is the practice that compounds. Every time the agent ships a regression that the tests missed, write the test that would have caught it. After six months the test suite is a force-multiplier on the agent’s accuracy.

The agent was given too much freedom. The engineer asked for a feature and got an architectural refactor as a bonus. The fix is to constrain the agent’s scope explicitly. “Only edit these files. Do not touch the auth module. Do not refactor the API surface.” Agents that are constrained produce constrained changes. Agents that are not produce sprawling ones.

The system has no orchestration layer above the agent. The agent does its work and there is no specialist agent reviewing the output, no automation kicking off the eval, no checkpoint between draft and merge. This is the slot where teams that have invested in their own orchestration — the agency-platform shape, where the orchestration is the platform and the agent is one of several specialists — get an obvious second-order benefit. The agent’s draft is not the deployable artifact. The draft passes through review specialists, eval gates, and human checkpoints before anything ships. Slop is a property of the system, not just of the agent in it.

What we tell new engineers

Two teams have asked me for the version of this piece they can hand to a new engineer on day one. The compressed version, for that audience:

You are going to be working with coding agents most days. Your job is not to type. Your job is to specify, direct, and review. Write the brief before you launch the agent. Read the plan before you let the agent execute. Read every line of the diff before you commit. Run the tests. If you cannot read the code, do not ship it. If the agent produces something you would not have written yourself, ask why — the answer is usually that the agent is right and your habit is wrong, or that the agent is wrong and you would have caught it if you had been writing the code. Either answer is useful.

The agents are good. They are not magic. They are leverage. Leverage works in both directions; a careless engineer with an agent ships bugs faster than a careless engineer without one. Be the engineer the leverage helps.

That is what we tell new engineers, and it is what eighteen months of vibe coding has taught us is the actual core skill. The tools will keep changing. The discipline does not.

— Ginger Wolfe-Suarez