Zero-Hallucination Architecture: How Evidence-Native AI Actually Works

Michael Read
Apr 17
4 min read

"Zero-hallucination" is not a marketing claim. It is a property you engineer for. If you're using an LLM to generate architecture documents, diagrams, or compliance reports and your answer to "how do you prevent hallucination?" is "we tell it not to hallucinate" or "we use RAG" — you're going to publish things that are wrong and call them architecture.

This post is about what it actually takes. Not vibes. Pipeline.

Why LLMs hallucinate on architecture tasks

Three reasons, in descending order of severity:

Architecture is compositional. "The payments service talks to the user service over gRPC" is four facts stacked: the services exist, they are connected, the protocol is gRPC, the direction is payments → user. An LLM can get any one of the four wrong while producing fluent prose. Fluency is not truth.
Training data is stale and generic. The model has seen a million diagrams of generic three-tier apps. When you ask about your system, it will pattern-match and fabricate what a "typical" system would look like.
No native notion of evidence. An LLM doesn't know the difference between "I read this in your OpenAPI spec" and "I inferred this from the naming convention." To the model, it's all tokens.

You cannot prompt-engineer these away. You have to architect around them.

The evidence-native pipeline

An evidence-native system treats every architectural claim as a triple: (assertion, source, state). The source is non-negotiable. The state is one of four — evidenced, inferred, missing, conflicting. The pipeline that produces the triples looks like this:

Ingest → Deterministic extract → LLM extract (schema-constrained) → Merge → Critic → Human review → Publish

Each stage has a specific job, and the order matters.

Stage 1 — Ingest

Documents (Word, PDF, Confluence, Markdown), code (Python, Java, TypeScript), schemas (DDL), APIs (OpenAPI, gRPC, AsyncAPI), IaC (Terraform, CloudFormation, Kubernetes), cloud inventory (AWS Config, Azure Resource Graph, GCP Asset Inventory), and runtime observations (tracing, logs, OpenTelemetry). Each ingested artifact is tagged with an evidence kind — we don't treat a PDF the same as a DDL statement.

Stage 2 — Deterministic extraction

Before an LLM touches anything, rules-based extractors run. DDL parsing produces table definitions. OpenAPI parsing produces endpoint contracts. Terraform parsing produces resources. Python AST walks produce module graphs. These extractors cannot hallucinate — they are finite-state machines reading unambiguous grammars. Their output is structured and cited by construction.

This is the part most "AI architecture" tools skip. It is the part that matters most.

Stage 3 — LLM extraction, schema-constrained

Now the LLM runs. Two rules:

It is not extracting facts from nothing. It is extracting from the deterministic pre-extraction plus the source document. The context is anchored in real evidence.
Output is schema-constrained. Not free-form prose. Structured output with required fields for assertion, source_ref, confidence, and state. If the LLM can't cite, it can't emit.

Schema-constrained output is the single biggest anti-hallucination lever available. It turns the LLM from an unreliable narrator into a structured data emitter. If you're using function-calling or structured output modes (OpenAI's JSON Schema mode, Anthropic's tool use), you're halfway there. The other half is making the schema require evidence fields.

Stage 4 — Dual-path merge

Now you have two bodies of claims: deterministic and LLM. They're merged into a single model. Agreement between paths raises confidence. Disagreement surfaces as a conflicting state — explicitly, not silently averaged away.

This is where most AI tools lose. They collapse conflict into a single confident-sounding sentence. An AIP preserves it. A compliance team would rather see "two sources disagree" than "a confident sentence that happens to be wrong."

Stage 5 — The critic

After merge, a critic pass runs. In Vaelith the critic is currently observe-only; later phases make it blocking. The critic checks:

Every claim has a source reference.
Every source reference resolves to a real, still-existing artifact.
No claim asserts a relationship between entities that don't both exist in the model.
Inferred claims have explicit reasoning captured.
Conflicting claims are surfaced, not merged.

If the critic fails, the artifact does not publish. This is invariant #6 — the human gate — enforced in software before it reaches a human.

Stage 6 — Human review

Schema-valid, critic-passing artifacts enter a review workflow with quorum and threaded comments. Humans see exactly what changed, what's evidenced, what's inferred, and what's conflicting. They approve, reject, or request revision. The review audit chain is content-hashed; tampering is detectable.

Only then does the artifact publish.

What this buys you

Four properties that are impossible to achieve with an LLM-only approach:

Verifiability. Every sentence in a generated HLD can be clicked back to the source line that justifies it.
Auditability. Every change is in an immutable review chain. You can answer "why did this claim change, who approved it, and what did they see?"
Graceful ignorance. When the sources don't say something, the model says "missing" instead of making it up. This sounds obvious. In practice, almost no AI tool does it.
Bounded LLM risk. If the LLM hallucinates, the critic catches it before a human sees it. If the critic misses, the human sees a citation that doesn't resolve and rejects it. Hallucinations don't reach production.

What this does not buy you

Honesty: this architecture is not free. It is more expensive to build than "just wrap an LLM around your docs." It produces fewer words per dollar. The diagrams take longer to regenerate. You have to maintain schemas and extractors.

In exchange, you get artifacts that can withstand an auditor, a regulator, a new CTO, or a post-incident review. For enterprise architecture, that's the only kind of artifact worth generating.

The anti-pattern to avoid

If anyone selling you an "AI architecture platform" can't answer these four questions, they are selling you a hallucination generator with a blue gradient:

What happens when the document doesn't say something? (Correct answer: it's marked missing.)
What happens when two documents disagree? (Correct answer: both are preserved and flagged as conflicting.)
How does a reader trace a claim back to its source? (Correct answer: a citation that resolves to a specific line.)
What prevents the LLM from emitting a claim with no evidence? (Correct answer: the output schema rejects it.)

No hand-wavy answers. No "we use prompt engineering." Those four questions separate the serious tools from the demos.

Want to see the pipeline run on one of your own architecture docs? Book a 20-minute demo →