Specification as Source Code: An Engineering View of AI-Assisted Development

When AI generation stops scaling

Ad hoc AI usage works for a while. A developer describes a function, pastes the result, fixes what looks wrong, and moves on. For a single file this is faster than typing. For a feature spanning a data layer, a domain layer, a presentation layer, routing, dependency injection, and localization across more than a dozen languages, it stops working. The output drifts. One feature uses one state management pattern, the next uses another. Naming diverges. Error handling becomes inconsistent. Two developers prompting the same model produce code that does not look like it belongs in the same repository.

The failure is architectural, not a failure of the model. An AI generator is a function of its input. Vague, conversational input yields plausible code that no two runs agree on. The engineering question is therefore how to make that input precise, repeatable, and reviewable. Solve that, and AI generation becomes a predictable stage in a delivery pipeline rather than a gamble.

A second decision matters as much as the input format: the target architecture itself. The workflow described here generates code against Clean Architecture, a layering model that separates a feature into a data layer for sources and repository implementations, a domain layer for entities and business rules, and a presentation layer for screens and state. The layers depend inward, so the domain knows nothing of the database or the framework. This is not an incidental choice. A layered architecture with strict dependency rules gives the generator a fixed skeleton to fill, and it gives the reviewer fixed boundaries to check. Generation needs a target shape, and Clean Architecture supplies one that is both rigid enough to enforce and testable by construction. AI generation improves when the target architecture has explicit boundaries, predictable file placement, stable dependency rules, and repeatable feature shapes.

Spec-driven generation, defined

Spec-driven generation is an approach in which the unit of work is not a prompt but a specification document: a structured, machine-readable description of a feature that an AI generator consumes to produce the implementation. The specification names the interface contracts, the screen and component hierarchy, the state model, the behavioral rules, and the success criteria. It is detailed enough that two different generation runs converge on substantially the same architecture.

The shift is subtle but consequential. A prompt is a conversation. A specification is an artifact. Artifacts can be versioned, reviewed, templated, and reused. Once the specification becomes the real source of truth, the generated code becomes a build output, and the team’s effort moves to where it now belongs: writing and reviewing the specification.

The architecture of a spec-driven workflow

A workflow that produces consistent code at volume is not one document and one model. It is a small system of components, each with a defined job. Four pillars carry the weight.

Documentation as machine-readable context

In a conventional project, documentation explains why decisions were made, in prose, for humans, usually written after the code. In a spec-driven workflow, documentation is structured input written before the code. It favors imperative constraints, decision tables, and copy-ready interface contracts over narrative. The same file that a developer reads as reference is the literal context the AI generator consumes. This dual role removes a familiar failure mode: documentation and code diverging over time. When the document is the input that produces the code, the two cannot drift at the moment of generation.

A project constitution that encodes constraints

Above the per-feature specifications sits a single governing document, loaded into every generation session. It carries the project’s non-negotiable architectural rules: the Clean Architecture layering, the chosen state management pattern, naming conventions, the patterns that are forbidden. This is the mechanism that enforces consistency across features and across developers. Without it, each generation run re-decides settled questions. With it, the architecture is a constant the model is not free to renegotiate.

Custom review agents

Generated code needs a gate. A custom review agent is an AI reviewer configured with the project’s specific standards rather than generic advice. It checks layer compliance, naming, state management structure, error handling, and security in defined passes, and returns a classified verdict rather than loose commentary. Its value is leverage. It applies project-specific scrutiny to every file at a speed no human reviewer matches, which frees human attention for the judgment calls that genuinely need it. As a later section shows, this pillar is also where the workflow’s hardest engineering problem appeared.

Domain-specific skills

A skill is a focused capability that handles one recurring, rule-heavy task well: generating screens from a template library, or producing localization entries across many languages with correct plural handling per language. Skills encode the project’s accumulated knowledge about a narrow problem so the generator does not improvise it each time. They are the difference between an output that follows the house pattern and one that merely resembles it.

How the pipeline runs

Put the pillars together and feature delivery becomes a multi-stage pipeline, each stage with a clear input and a clear output.

It opens with design and a specification draft. A developer designs the screens, then writes the instruction document: interface contracts, component hierarchies, the state model, behavioral rules, success criteria. This is the most demanding human stage, and deliberately so. Effort spent here is effort the model does not have to guess at later.

The specification is then reviewed before it generates anything. The review agent validates the document itself, catching contradictions in contracts and gaps in layer definitions while they are still cheap to fix. Reviewing the blueprint is far cheaper than reviewing the building.

Generation follows. The model consumes the constitution, the feature specification, and the relevant technical references, and produces the full feature across every Clean Architecture layer at once.

Generation is followed by iterative human review. The developer reviews each screen, lists specific issues, returns them to the model, and verifies the fixes, usually across several rounds. A closing pass runs the review agent for a final structured verdict before the feature reaches a pull request.

What the measurements show

A workflow is only as credible as the numbers it can produce, and the honest numbers are more interesting than the optimistic ones. The figures here come from LiteBreeze’s own work: a Flutter rewrite of an asset and receipt management application, built using the spec-driven workflow described above. Across three completed sprints, with effort measured against estimate, the pattern is sharp and uneven.

Development work came in roughly 39% under estimate. Non-development work, the discussion, planning, and review, ran roughly 9% over estimate. Blended together, the total saving across completed sprints was about 23%. AI assistance did not lift the whole delivery lifecycle by a uniform amount. It compressed the coding hard and left the surrounding work heavier than planned.

The reason the non-development line ran over is not a failure to read as one. It is the cost of building the workflow itself. A spec-driven pipeline does not exist for free; it is constructed in a one-time groundwork phase that defines the prompt and context system, writes the project constitution, configures the review agent, and builds the skills. That groundwork is non-development effort, and it landed inside the same sprints being measured. Separate the one-time build from the recurring per-sprint cost and the overrun reads as investment with a known payback, not as waste.

The development trend is the strongest evidence that the investment compounds. Across the three sprints, the development saving rose every time: roughly 30%, then 33%, then 50%. The workflow got better at its job as the infrastructure matured, and the curve points the same direction each sprint rather than wandering.

One discipline underwrites all of these figures: incomplete work is excluded. A fourth sprint was in progress when the data was cut, and including its partial numbers would have pushed the headline saving from 23% up toward 34%. The higher number would have been the more flattering one to publish. It would also have been wrong, because a sprint measured a third of the way through reports a saving it has not yet earned. The figures above are effort against estimate over completed work only, and they are not adjusted for code quality, which is measured separately.

The hardest problem was the reviewer

The review agent is the pillar that caused the most unplanned engineering, and the story is worth telling plainly because it generalizes.

The agent was built to score generated code and issue a verdict. In practice it did something a reviewer must never do: it gave different answers to the same question. Run it twice on identical code and it would surface different findings and produce a different score. The cause is not a bug in the usual sense. A language model samples its output probabilistically, so a task framed as open-ended judgment will vary between runs. Several smaller faults fed the same problem. The model’s attention drifted on large inputs, catching a violation on one run and missing it on the next. The boundary between a warning and a mere suggestion was treated as a matter of degree rather than a fixed rule. Worst of all, the numeric score was synthesised from a general impression rather than computed, so identical violations could yield different numbers.

The fix was not a better prompt. It was a reallocation of work between the model and ordinary tooling. Deterministic checks were moved off the model entirely: pattern-based triggers decide, in binary fashion, whether a rule is violated, and the score is now a mechanical formula over passed and failed checks rather than a figure the model invents. Each rule was given a stable coded identifier so the same violation is named the same way across every review. Both review passes are now always run to completion before any verdict is computed, which closed a gap where a critical security finding in the second pass could not block a merge. The agent’s qualitative judgment, the part where variation is acceptable, is all that remains the model’s job.

The lesson is an architectural one and it outlives this project. A language model is the wrong tool for a deterministic check. Asking it to decide whether a file imports a forbidden package, or to compute a score, invites variance into a place that must not have any. Those checks belong in linters, static analysis, and pre-commit hooks, where the same input always produces the same output. The model should be reserved for the judgment that genuinely needs judgment. A team that learns this early spends far less on remediation than one that discovers it after trusting a verdict that quietly changed between runs.

The honest trade-offs

Spec-driven generation buys consistency and a real, measured saving on development effort. It does not buy them for free, and an engineering team should weigh the costs plainly.

The infrastructure is a genuine front-loaded cost. The non-development overrun is not a rounding error; it is the visible price of building the prompt system, the constitution, the review agent, and the skills. A team that adopts this model should expect its first sprints to look worse on the blended number than steady-state sprints will, and should measure one-time and recurring cost separately or it will misread its own return.

Human review remains a bottleneck. Reviewing dozens of generated files per feature invites fatigue, and a tired reviewer misses real defects. For instance, one of the feature PRs had over 100 file changes. The volume that makes the approach attractive is the same volume that strains the gate meant to protect it.

Specification quality sets a hard ceiling on output quality. The model cannot exceed the precision of its input. Recurring review issues are usually not model failures but specification gaps, which means a team must treat its specification format as a product that improves over time.

Speed is not yet quality. The saving figures measure effort against estimate, not defect rates. Faster delivery of code that has not been independently verified for correctness is a narrower claim than it first appears, and a team should resist reading a development saving as a quality result.

Shared files become contention points. Centralized dependency injection registries, route tables, and localization files are touched by every feature, so parallel generation produces merge conflicts that offset part of the speed. This is a structural cost, addressable only by partitioning those files per feature.

The gap the workflow has not yet closed

The quality gate described so far has a blind spot, and the team that built this workflow names it openly. Both review passes inspect the code without running it. Pass one checks structure and rules; pass two applies qualitative judgment. Neither executes the feature. The first time generated code actually boots is when a human opens an emulator and tries it.

This leaves one class of defect invisible until manual review. AI-generated code can be architecturally clean, correctly layered, and free of lint violations while still being wired up wrong. For instance, a dependency never registered, a route never added, a repository whose every path returns the failure branch. Static review cannot see any of this, because nothing in a static read of correct-looking code reveals that it does not run.

The proposed remedy is a smoke test, a fast, shallow automated check that confirms a feature boots, renders, and executes its primary action. It is not full integration coverage. It is the minimum mechanical proof that the wiring holds. A smoke test belongs in the pipeline between the agent review fix step and the manual feature review, with one gating rule: a feature that fails smoke does not consume human review time. It returns to the fix step. The expensive human stage sits behind a cheap machine stage that runs first.

The value of this addition fits the same principle that shaped the rest of the workflow. It is mechanical, binary, and needs no human, the same category as moving deterministic checks out of the review agent. It produces the first objective quality signal the workflow has, a per-feature smoke pass rate, which begins to answer the open question of whether faster delivery is also sound delivery. And it is the smallest viable form of a larger capability already on the roadmap: a device-testing agent that drives real emulators and simulators. A workflow this disciplined about specification and review still has its first execution gate left to build, and naming that gap honestly is itself part of the discipline.

Practices that hold the workflow together

Treat the specification format as a versioned product. Maintain feature-type templates for the common patterns, and improve the templates whenever a review surfaces a repeating issue.
Keep deterministic checks out of the model. Pattern matching, import rules, and scoring belong in linters, static analysis, and pre-commit hooks; reserve the AI reviewer for qualitative judgment.
Measure one-time and recurring cost separately. Groundwork is built once; folding it into per-sprint figures hides the true steady-state return.
Exclude incomplete work from headline metrics. A partially finished sprint reports a saving it has not earned, and publishing it overstates the result.
Generate tests in the same session as the implementation. A Clean Architecture codebase is testable by construction; if the specification states test expectations, the model can produce verification alongside the feature.
Partition shared files by feature. Per-feature registration and routing fragments let parallel generated features merge without colliding in the same central file.
Gate human review behind a smoke test. A fast automated check that the feature boots, renders, and runs its primary action catches wiring defects that static review cannot see, and keeps reviewer time off code that does not run.

The blueprint, revisited

A blueprint does not pour concrete. It decides, in advance, what the concrete will become, and a good one makes the construction phase almost mechanical. Spec-driven AI development moves the same decision forward. The model pours fast, and the specification, the constitution, the review agent, and the skills are the drawing that determines whether the result is sound.

The numbers bear this out. On the Flutter project that LiteBreeze delivered, development effort fell by nearly 40% and improved every sprint, the surrounding work carried the one-time cost of building the drawing, and the workflow’s hardest problem was solved not by a cleverer prompt but by moving deterministic work to deterministic tools. The instinct to measure AI assistance by how fast the model writes code is the instinct of someone watching the concrete pour and ignoring the drawing. The question worth asking of any AI-assisted workflow is not how fast it generates. It is how precisely it can describe what it wants before generation begins.

Your email address will not be published. Required fields are marked *