Vibe coding, Software 2.0, and the relocation of engineering difficulty

Summary

AI coding assistants compress the easy parts of software engineering (boilerplate, CRUD, standard patterns) from hours to minutes. The hard parts (system design, debugging code you didn't write, production reliability) get harder, not easier. Vibe coding is accelerated Software 1.0 with a natural language interface, not Karpathy's Software 2.0 where logic itself is learned. Understanding that distinction determines whether you ship faster or just accumulate technical debt faster.

Background & context

Three things happened in parallel over the last two years, and most people are conflating them.

First, LLM-based coding assistants went from novelty to infrastructure. GitHub Copilot launched in 2021 as a clever autocomplete. By 2024, Cursor, Claude Code, Aider, and a dozen other tools made it possible to generate entire features from a natural language description. The term "vibe coding" emerged to describe this workflow: you describe what you want in plain English, iterate on the output, and ship the result.

Second, Andrej Karpathy's "Software 2.0" thesis from 2017 got renewed attention. His argument: neural networks that learn behavior from data are replacing hand-written logic in an increasing number of domains. Computer vision, speech recognition, game playing, and increasingly recommendation systems and code generation itself operate this way. The "code" is the weights, not the source files.

Third, developers started reporting a paradox. The tools made them faster, but the work felt harder, not easier. Side projects shipped in days instead of weeks. Production incidents took longer to diagnose. Code reviews got longer because reviewers had to understand code written by a model, not a human colleague.

These three threads are related but distinct. The confusion between them is causing real problems in how teams adopt AI tooling.

Technical deep dive

The 3D printing analogy

The most useful framing for vibe coding comes from the 3D printing world. You don't 3D print a house. You 3D print your tools.

3D printing excels at rapid prototyping, custom jigs, one-off fixtures, and iteration. It's terrible at mass production, structural load-bearing construction, and anything requiring material properties that filament can't provide. The people who get the most value from 3D printers use them to make the tools that then make the final product.

Vibe coding works the same way. It's excellent for generating API clients from OpenAPI specs, scaffolding project structures (Next.js apps, Django projects, database migrations), writing test fixtures and mock data, producing first drafts of CRUD endpoints, and creating one-off scripts for data transformation.

It's dangerous for core business logic with domain-specific invariants, concurrent or distributed systems code where ordering matters, security-critical authentication and authorization flows, performance-sensitive database queries, and any code where the failure mode is "silently wrong" instead of "obviously broken."

The difference is about where the constraints live. Boilerplate has well-known constraints. The model has seen ten thousand Express routers and can generate a correct one. Your business logic has constraints that exist in your team's heads, your domain's regulations, and your production database's current state. The model has never seen those.

Why the hard parts get harder

When you write code yourself, you carry a mental model of what the code does and why. This model is incomplete, but it exists. When a model generates code, you have two choices: read every line carefully and build that mental model, or skip the reading and hope for the best.

Most developers, under time pressure, do something in between. They skim. They check that the code runs. They look for obvious errors. This works for boilerplate. It fails for subtle bugs.

Consider a concrete example. An LLM generates a database migration that adds a column with a default value. The migration runs fine locally with 100 rows. In production with 50 million rows, it locks the table for six minutes because the database rewrites every row to set the default. The code is "correct" in the sense that it produces the right schema. It's catastrophically wrong in the sense that it takes down your service.

Or take the N+1 query pattern. An LLM generates a React component that fetches user data, then maps over the results and fetches each user's profile in a separate request. Locally with 20 users, it's fine. In production with 2,000 users on the page, you've just sent 2,001 HTTP requests where 1 would have sufficed. The code looks clean. The variable names are good. The pattern is common in tutorials. It's a performance disaster.

This class of problem existed before AI. But AI amplifies it in three specific ways.

Velocity increases surface area. If you ship five times as much code per week, you have five times as much code that can break in production. The ratio of "code you understand deeply" to "code you've skimmed" shifts toward the latter.

Generated code hides bugs behind readability. LLMs produce code that looks like what a competent developer would write. Good variable names. Reasonable structure. Common patterns. This makes it easy to read and hard to scrutinize. Your brain's pattern-matching says "this looks fine" before your analytical reasoning catches the edge case.

Debugging becomes archaeology. When a bug appears in code you wrote, you can reconstruct your reasoning. When it appears in code a model generated, you're doing forensic analysis on someone else's logic. That "someone else" isn't available for questions. You have to read the code as if it were written by a stranger, because it was.

Software 2.0 vs. accelerated Software 1.0

Karpathy's Software 2.0 thesis identifies a real shift in how software gets built. In Software 1.0, you write explicit instructions: "if the user is authenticated and has role X, allow access to resource Y." In Software 2.0, you define an objective, collect training data, and optimize weights until the system behaves correctly. The "code" lives in a .pt or .safetensors file, not a .py file.

Vibe coding is not Software 2.0. It is Software 1.0 with a natural language compiler.

When you type "create a REST endpoint that returns paginated users with their last login date" and the model generates Express code, you still get explicit, deterministic logic. The route handler checks the page parameter, queries the database, transforms the result, and returns JSON. Every step is inspectable. The logic is fixed at generation time. If the requirements change, you regenerate or manually edit.

This matters because the debugging strategies are different. Software 1.0 bugs are logic bugs: you read the code, find the wrong branch, fix it. Software 2.0 bugs are behavior bugs: the model does the wrong thing on certain inputs, and you fix it by adding training data or adjusting the loss function. If you treat vibe-coded Software 1.0 as if it were Software 2.0 (just regenerate and hope it's better), you lose the ability to systematically fix problems. If you treat it as Software 1.0 (which it is), you need to read and understand every line before it ships.

The Open Vibe approach to shipping SaaS with AI captures this distinction well in practice. The workflow uses AI to generate the 80% of a SaaS app that's standard (auth, payments, dashboard, database schema) and then expects the developer to understand, customize, and own that generated code. The AI is a faster keyboard, not a replacement for engineering judgment.

Comparison & analysis

To see why the distinction between accelerated Software 1.0 and actual Software 2.0 matters, compare two approaches to the same problem: building a content moderation system for a platform.

Approach A uses vibe-coded Software 1.0. You prompt an LLM to generate a moderation service. It produces a Python service with rules: check against a blocklist, scan for regex patterns, apply heuristic scores. You iterate on the prompts, refine the rules, and deploy. When it misclassifies something, you add a rule or adjust a threshold.

Approach B uses Software 2.0. You define a classification objective, collect labeled examples of toxic and non-toxic content, and fine-tune a model. The model learns patterns in the data that no human would write as explicit rules. When it misclassifies something, you add it to the training set and retrain.

Both approaches use AI. They have fundamentally different failure modes, iteration speeds, and scaling properties.

Dimension	Vibe-coded Software 1.0	Software 2.0
Iteration speed	Fast to first version, slow to improve beyond obvious rules	Slow to first version (data collection), fast to improve with more data
Failure mode	False negatives from missing rules; brittle under adversarial input	False negatives from training data gaps; unpredictable on edge cases
Debugging method	Read the rules, find the gap	Analyze the data distribution, check the loss
Scaling behavior	Degrades as rules conflict and interact	Improves with more data and compute
Explainability	Each rule is human-readable	Requires interpretability tooling

The vibe-coded approach works well when the rules are simple and the domain is well-understood. The Software 2.0 approach works well when the patterns are complex and the data is abundant. Most real systems need both: a Software 2.0 classifier backed by Software 1.0 guardrails and fallbacks.

The mistake is using vibe coding when you need Software 2.0, or treating Software 2.0 output as if it were inspectable Software 1.0 logic. The first gives you a brittle rule engine that can't handle the complexity. The second gives you a black box you can't debug.

Compare this with traditional pair programming. In a pair, two humans share a mental model of the code. When one writes, the other reviews in real time. The reviewer can ask "why did you do it this way?" and get an answer. With vibe coding, the "pair partner" is a model that can't explain its specific reasoning for a particular line of generated code. It can give you a plausible post-hoc explanation, but that's different from the actual reasoning path that produced the output. Code review for generated code needs to be more thorough, not less, than review for human-written code.

Practical implications

For senior engineers and engineering leads, the practical takeaway is about where you direct AI assistance and where you resist it.

Use AI to print your tools. Generate the scaffolding, the scripts, the test fixtures, the migration templates. These are the jigs and fixtures of software development. They need to be correct, but they don't carry the core business logic. If a generated migration script has a bug, you catch it in testing. If your generated auth flow has a subtle security flaw, you might not catch it until someone exploits it.

Read every line of generated code before it ships. This is the discipline that separates productive AI use from dangerous AI use. The model is not a colleague who will catch their own mistakes in review. It's a very fast typist who sometimes types plausible nonsense. You are the reviewer. Every time.

Distinguish between generation and understanding. When you generate code with AI, you get the output without the understanding that writing it yourself would have forced. You need to deliberately build that understanding after the fact. This takes time. Sometimes it takes as long as writing the code yourself. If it does, that's fine. The point of AI assistance is not speed at all costs. It's speed with maintained quality.

Recognize when you need Software 2.0, not faster Software 1.0. If your problem involves pattern recognition over complex data (content moderation, recommendation, anomaly detection, natural language understanding), vibe-coded rules will hit a ceiling. Invest in data collection and model training instead. The upfront cost is higher, but the ceiling is much higher too.

Set team norms around generated code. At a minimum: generated code should be marked in review, reviewed by someone other than the person who generated it, and tested with the same rigor as hand-written code. Some teams require a comment on each generated file indicating it was AI-assisted. This isn't about blame. It's about calibrating review attention.

The hardware and deployment story is straightforward. LLM coding assistants run as cloud services (Copilot, Cursor's cloud mode) or locally (Ollama, local LLMs with sufficient VRAM). For most developers, the cloud services are fast enough and good enough. The local option matters for air-gapped environments or codebases with strict data residency requirements. Either way, the generated code runs on your infrastructure with your dependencies and your security posture. The model doesn't change your deployment pipeline. It changes what flows into it.

The engineers who will struggle most in this shift are the ones who treat vibe coding as a complexity reducer. It isn't. It's a complexity relocator. The code generation gets easier. The code ownership gets harder. The faster you generate, the more disciplined your review and testing practices need to be. That's the trade, and there's no way around it.

References

"You don't 3D print a house. You print your tools." - https://dev.to/aws-heroes/you-dont-3d-print-a-house-you-print-your-tools-2h00
"AI Didn't Make Software Engineering Easier. It Made the Hard Parts Harder." - https://dev.to/iampraveen/ai-didnt-make-software-engineering-easier-it-made-the-hard-parts-harder-39n4
"Open Vibe: Ship your SaaS with AI. Without getting stuck." - https://dev.to/wasp/open-vibe-ship-your-saas-with-ai-without-getting-stuck-e2h
"Andrej Karpathy: Software in the era of AI" - https://www.youtube.com/watch?v=LCEmiRjPEtQ

Vibe coding, Software 2.0, and the relocation of engineering difficulty

Summary ​

Background & context ​

Technical deep dive ​

The 3D printing analogy ​

Why the hard parts get harder ​

Software 2.0 vs. accelerated Software 1.0 ​

Comparison & analysis ​

Practical implications ​

References ​