Constraint-Driven Vibe Coding

Measuring AI-generated code beyond speed through a Laravel importer benchmark.

Laravel Artisan · CSV importer · 930 benchmark rows · VCPRI scorecard

Abstract

Vibe Coding is often sold as a speed story: how quickly an AI tool can turn a prompt into code that runs. That is the least interesting measurement. Fast code can still be wrong, expensive, fragile, or unsafe.

This study treats Vibe Coding as an engineering problem. It asks what happens when generated code is judged by production constraints instead of first-run success. The case study is a Laravel product CSV importer tested against a fixed benchmark dataset derived from real Amazon product data.

The Claim

The danger in AI-generated code is not that it can be bad. Developers could write bad code long before AI entered the room. The new danger is that generated code can look complete before it has proved anything.

A demo importer may read a CSV, insert products, and print a friendly summary. Production asks colder questions: Did it reject broken rows? Did it avoid loading the whole dataset into memory? Did it turn one import into thousands of queries? Did it leave enough evidence to debug supplier data next week?

Most public discussion around AI coding is too confident for the evidence it shows. One side says AI is already better than most programmers. Another side says AI code is disposable junk. Both arguments are usually built from impressions: a screen recording, a surprising completion, a broken function, or a personal workflow story. Those examples may be useful, but they are not measurement.

This benchmark asks a smaller question: does a generated importer behave better when the prompt contains the constraints a production engineer would normally carry in their head?

Prior Evidence and Public Debate

Karpathy's original vibe coding post matters less as a rigorous claim than as a cultural label. It named a mode of work many developers had already felt: describe intent, accept large generated changes, and steer the result by conversation. That is useful as a description of practice, but it is not by itself a standard of quality.

The optimistic case is real. GitHub's Copilot research reported that, in a controlled JavaScript HTTP-server task with 95 professional developers, the Copilot group finished substantially faster than the control group. That result is worth taking seriously because it used random assignment and a scored task. It is also narrower than many public retellings of it: the task was bounded, the success criteria were known, and the study measured completion of that task rather than production durability.

The skeptical case is also real. METR's 2025 study tested experienced open-source developers on real issues in repositories they already knew, and found that allowing AI increased completion time in that setting. METR now labels those results historical because newer 2026 results are available, so the point here is not that AI always slows developers down. The point is methodological: realistic work can overturn what feels obvious in smaller demonstrations.

Trust surveys point in the same direction. The 2025 Stack Overflow Developer Survey reported that more developers distrusted the accuracy of AI-tool output than trusted it. That does not mean developers reject AI. It means verification has become part of the workflow, especially for people accountable for what ships.

The strongest framing for this article comes from DORA's 2025 AI-assisted software development report: AI behaves like an amplifier of an organization's existing strengths and weaknesses. A strong prompt, a known contract, tests, logging, and review give the model a system to amplify. A vague request gives it mostly surface shape to optimize.

Security research makes the same warning sharper. Broken by Default, a 2026 formal-verification study of AI-generated security-critical code, reported vulnerable artifacts across all tested models. This importer benchmark is not a security benchmark, but it borrows the same discipline: do not ask whether the output looks plausible; ask which contract it can survive.

Why Speed Is a Weak Signal

The fastest implementation is often the easiest one to overvalue. A command exists, a table receives rows, and the terminal prints a summary. The empty application has motion, and motion can look like progress.

For importers, that feeling is especially misleading. Bad data is still data. A weak importer may store invalid statuses, quietly coerce broken quantities to zero, accept malformed variants, or update an existing product with the wrong identity. The damage appears later in filters, inventory numbers, support tickets, and hard-to-explain catalog drift.

That is why this benchmark gives every strategy the same dirty input and a known expected outcome. The test is not whether the command runs. The test is whether it leaves the database in the right state and tells the truth about what it rejected.

Benchmark Shape

The raw source is an Amazon product dataset. It contains real product names, prices, categories, and variant noise. It is not used directly as the benchmark because public data alone does not provide a known expected outcome.

This distinction is important. Real data gives the benchmark texture: long product names, inconsistent descriptions, empty supplier fields, prices formatted as strings, and variant data that arrives in awkward shapes. Controlled data gives the benchmark judgment: before a command runs, the expected inserted, updated, and rejected counts are already known.

The benchmark therefore keeps the texture of real data, then wraps it in a measurable contract: 930 input rows, 600 expected inserts, 250 expected updates, and 80 expected rejects.

The benchmark flow is:

Amazon Product Data
Benchmark Builder
Laravel Artisan Runs
Measured Results
VCPRI Score

The operational setup has three separate parts: controlled import input, preloaded existing products, and a ground-truth table used only after execution. The importer sees the input rows and the preloaded products. It never sees the expected answer while it runs; that answer is used only for scoring.

Importer Contract

The benchmark does not model Amazon's full catalog. It defines a modest internal product contract that is enough to test importer behavior.

Internal field	CSV field	Type	Rule
`source_id`	Uniq Id	string	Original source row identifier.
`sku`	Sku	string	Primary product key when present.
`asin`	Asin	string	Fallback identifier when `Sku` is missing.
`name`	Product Name	string	Required and non-empty.
`brand`	Brand Name	string	Optional.
`category`	Category	string	Optional.
`price`	Selling Price	decimal	Required, numeric, and greater than zero.
`list_price`	List Price	decimal	Optional; numeric when present.
`quantity`	Quantity	integer	Required in the benchmark; zero or greater.
`status`	Stock	string	Must be one of the allowed benchmark statuses.
`variants`	Variants	json	Optional; valid JSON when present.
`description`	Product Description	text	Optional; length-limited.

Product identity follows a fixed fallback chain: Sku, then Asin, then Uniq Id. A row without any usable key is rejected. This keeps the realism of the source data while giving every implementation a stable contract.

The contract is intentionally modest. It does not try to model images, shipping weight, dimensions, or every marketplace detail. That restraint is part of the design: the benchmark isolates the behaviors that usually break first when generated code moves from demo data to supplier data.

Scenario Protocol

Each strategy is implemented as a Laravel Artisan command in the same application. All commands read the same controlled import data and the same preloaded product state. They report inserted, updated, rejected, failed, runtime, peak memory, and query count.

The protocol matters because otherwise the comparison becomes theater. If the raw version is repaired after seeing the results, it is no longer raw. If the baseline is quietly optimized, it no longer represents ordinary Laravel code. If the constraint-driven version receives a different benchmark, it no longer proves anything about constraints.

The benchmark compares three strategies:

Mid-level Laravel baseline: Practical Laravel code using Eloquent validation, row-by-row reading, basic reject logging, and updateOrCreate.
Raw vibe coding: A broad AI prompt with only small integration fixes. It is not sabotaged; it is simply under-specified.
Constraint-driven vibe coding: AI generation with explicit limits for memory, validation, query pressure, reject logs, and testability.

The raw strategy is not sabotaged. It is intentionally under-specified. Many generated features fail not because the prompt asks for bad code, but because it does not name the pressure the code must survive. The mid-level baseline is also not a straw man: a Laravel command with Eloquent validation and updateOrCreate is a recognizable first production pass. The constraint-driven version is allowed to be more deliberate because its prompt explicitly names memory, validation, query, logging, and testability budgets.

What the Constraint Changes

The difference starts before code exists. A raw prompt names the feature. A constraint-driven prompt names the feature and the conditions under which the feature is allowed to count as done.

Raw prompt:

Build a Laravel CSV importer for products. It should read the CSV,
insert new products, update existing products, reject invalid rows,
and return a summary.

Constraint-driven prompt:

Build a Laravel product importer for the benchmark CSV. Do not load
the full dataset into memory. Use sku, then asin, then source_id as the
product-key fallback. Validate price, quantity, status, variants, and
description length. Reject duplicate product keys inside the import
input. Preload existing keys to reduce database queries. Log every
rejected row with row number and reason. Return measured counts,
runtime, memory, and query count.

The important part is not that the second prompt is longer. It makes failure visible. A row is not merely skipped; it is classified. That classification is what allows the benchmark to compare behavior instead of relying on a general feeling that one implementation is cleaner.

Validation contrast:

// Raw-style validation: enough for a demo, not enough for the contract.
if ($productKey === '' || trim($row['name']) === '' || $row['price'] <= 0) {
    reject($rowNumber);
}

// Constraint-driven validation: every controlled failure has a rule.
if ($productKey === '') return reject($rowNumber, 'missing_product_key');
if (trim($row['name']) === '') return reject($rowNumber, 'empty_name');
if (! is_numeric($row['price']) || $row['price'] <= 0) return reject($rowNumber, 'invalid_price');
if (! preg_match('/^-?\d+$/', $row['quantity']) || $row['quantity'] < 0) return reject($rowNumber, 'invalid_quantity');
if (! in_array($row['status'], $allowedStatuses, true)) return reject($rowNumber, 'invalid_status');
if ($row['variants'] !== '' && json_decode($row['variants']) === null && json_last_error() !== JSON_ERROR_NONE) {
    return reject($rowNumber, 'invalid_variants_json');
}

Scoring

The Vibe Coding Production Readiness Index, or VCPRI, combines correctness, resource use, security, failure behavior, and maintainability.

VCPRI = 0.25C + 0.20M + 0.15R + 0.15Q + 0.10S + 0.10F + 0.05T

Here, C is correctness, M memory efficiency, R runtime, Q query efficiency, S security, F failure handling, and T testability and maintainability.

Checklist Score = Passed Checks / Total Checks x 100
Memory Score = Lowest Peak Memory / Current Peak Memory x 100
Runtime Score = Fastest Runtime / Current Runtime x 100
Query Score = Lowest Query Count / Current Query Count x 100

The weights are local to this importer. Correctness receives the highest weight because wrong data is not a performance problem; it is a trust problem. Memory, runtime, and query count receive strong weights because import jobs often fail under scale before they fail in small examples.

Metric	Weight	Reason
Correctness	0.25	The importer must match expected inserts, updates, and rejects.
Memory	0.20	CSV imports grow by input size, and memory exhaustion is a hard failure.
Runtime	0.15	Time matters, but a correct slow importer is better than a fast corrupt one.
Query count	0.15	Query pressure is often the hidden cost of row-by-row import logic.
Security	0.10	Imports touch external data and must avoid unsafe parsing and silent trust.
Failure handling	0.10	Rejected rows must be visible, classified, and debuggable.
Testability and maintainability	0.05	The code must be understandable enough to change safely.

Measured Results

The benchmark was run through real Laravel Artisan commands using SQLite for repeatable local measurement. Each strategy was executed three times, and the reported values are averages.

Strategy	Inserted	Updated	Rejected	Memory MB	Runtime s	Queries
Mid-level Laravel baseline	600	250	80	28.0	12.0625	1700
Raw vibe coding	640	258	32	30.0	13.2861	1796
Constraint-driven vibe coding	600	250	80	26.0	3.1975	851

The summarized VCPRI scores were:

Raw vibe coding: 43.30 VCPRI
Laravel baseline: 78.06 VCPRI
Constraint-driven: 100.00 VCPRI

Strategy	C	M	R	Q	S	F	T	VCPRI
Mid-level Laravel baseline	100.00	92.86	26.51	50.06	100.00	100.00	60.00	78.06
Raw vibe coding	25.00	86.67	24.07	47.38	40.00	40.00	20.00	43.30
Constraint-driven vibe coding	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00

Reading the Results

The raw implementation finished, but it did not obey the import contract. It inserted 640 rows instead of 600, updated 258 instead of 250, and rejected only 32 of the 80 invalid rows. That is the most dangerous kind of importer failure: the command completes, the database fills, and the mistake hides inside accepted data.

The Laravel baseline did the right thing functionally. It inserted, updated, and rejected exactly as expected. Its weakness was shape, not correctness. Row-by-row Eloquent plus updateOrCreate produced 1700 counted database operations and the slowest correct run.

The constraint-driven version changed the design before the run: stream rows, validate explicitly, preload existing product keys, reject duplicate keys inside the input, and write structured reject logs. It reached the same correct outcome with 851 counted queries and much lower runtime.

This is the practical difference between a feature request and an engineering request. “Build an importer” describes a visible outcome. “Build an importer that streams input, validates a fixed contract, avoids per-row existence checks, rejects duplicate keys, and logs every failure reason” describes the conditions under which the outcome is allowed to count as done.

The raw result is useful precisely because it is plausible. It did not crash. It did not produce an empty table. It produced numbers that could pass a casual review if nobody checked the ground truth. That is why benchmarks matter in AI-assisted coding: they replace the feeling of completion with an external account of what actually happened.

What This Shows About AI Coding

The study does not prove that AI is better than a developer, or that constraint-driven prompting always wins. It shows something more actionable: AI output responds strongly to the shape of the request. When the prompt is vague, the output tends to optimize for visible completion. When the prompt contains measurable pressure, the output can move toward engineering behavior.

That should make developers more active, not less. The engineer's job shifts from typing every line to defining the boundaries that make generated code measurable: what must be rejected, what must be logged, what must not be loaded, and what outcome counts as correct.

Limits

This is a measured case study, not a universal law. The benchmark uses SQLite, so MySQL or PostgreSQL may change absolute runtime. The workload has 930 rows, enough to expose validation and query-shape differences, but not enough to claim large-import behavior at 100,000 rows.

Runtime is measured as full Laravel Artisan command time. That includes framework boot cost: Composer autoloading, service providers, configuration, command resolution, and database setup. This cost is acceptable for the comparison because every strategy pays it in the same Laravel application under the same environment. It may affect the absolute runtime numbers, but it does not give one scenario a special advantage inside this benchmark.

Still, the study is useful because the failure modes are ordinary. Supplier data is messy. Product keys disappear. JSON breaks. “Working” importers accept rows they should reject. These are exactly the cases where AI-generated code needs constraints before it needs applause.

A next version should run the same scenarios against a larger workload, a networked database, and a stricter test suite around reject reasons. It should also separate cold-start framework boot from the import loop itself.

Conclusion

The useful lesson is not that AI code is good or bad. The useful lesson is that unconstrained generation is too easy to trust. Raw vibe coding produced an importer-shaped result and failed the contract. The baseline was correct but heavy. The constraint-driven version was correct and cheaper because the prompt described the pressure the code had to survive.

For production work, the prompt should not stop at “build the feature.” It should define data size, validation rules, query budget, memory behavior, failure logs, security boundaries, and testable structure. At that point, Vibe Coding begins to look less like a shortcut and more like engineering.

For anyone who wants to dive deeper, I’ve attached the full study as a PDF here with all its details.

Constraint-Driven Vibe Coding