Fixing AI quality without changing models: our sovereign stack rebuild

There is a widespread assumption in legal AI that quality is primarily a function of model selection—that the path to better answers runs through larger, more expensive language models. We spent a sprint proving that assumption wrong. In a single focused rebuild, PrivateNode went from 43% to 63% usable answer quality and from 89% to 96% keyword accuracy, without changing the underlying LLM.

The quality problem we were hiding from ourselves

Our original evaluation framework relied heavily on keyword accuracy: does the response mention the correct statute, the right section number, the relevant legal term? By that measure, we were at 89% and feeling confident. Then we introduced qualitative evaluation—an LLM-as-judge layer that rates each answer as GOOD, ACCEPTABLE, POOR, or FAIL based on whether a professional could actually use it. The result was sobering: only 43% of answers were genuinely usable.

The gap between 89% and 43% is the gap between mentioning the right terms and actually providing a useful answer. A response that says "Section 188 of TULRCA 1992 covers collective redundancy consultation" scores as a keyword pass. But if it fails to specify the 45-day and 30-day thresholds, or omits the penalties for non-compliance, it is not an answer a solicitor can use. We needed to close that gap.

Six fixes that moved the needle

1. System prompt architecture. We rebuilt every specialist prompt using XML structure—<role>, <instructions>, <clarification_policy>—and replaced over 20 negative guardrails ("NEVER do X") with a clear affirmative policy: answer first, caveat later. Partial information is still useful. The old prompts were training the model to refuse rather than attempt.

2. A silent data bug. Our legislation filter queries contained a regex compatibility issue that silently returned zero results for all family law, probate, and company law queries. Every question in those domains was searching the entire corpus instead of the relevant statutes. A single query fix resolved the issue overnight — a reminder that the most damaging bugs are the ones that fail silently.

3. Tool consolidation. We reduced from 56 to 36 tools by replacing 20 individual calculator functions with a single unified dispatcher. Fewer tools means fewer routing errors and faster inference.

4. Contextual retrieval, done correctly. Our first attempt at contextual retrieval prepended enriched context directly to each document chunk. Quality dropped 11 percentage points — the AI was confused by metadata mixed into the source text. Version 2 separated the concern: enriched text is used for search and retrieval, while the AI sees the original clean source when generating answers. That change alone added 3 percentage points.

5. Cross-panel routing. Multi-domain queries ("What are the tax implications of selling a leasehold property?") now route across specialist panels automatically, combining responses from Tax, Property, and Legal specialists instead of forcing a single specialist to answer outside its domain.

6. Measuring false refusals. We introduced a new metric: AQRR (Answerable Question Refusal Rate)—the rate at which the system refuses to answer questions it demonstrably has the data to answer. If a user asks about redundancy consultation periods and the system has TULRCA 1992 s.188 indexed, a refusal is a quality failure, not a safety success.

Before and after

Metric	Before	After
Qualitative usable (GOOD + ACCEPTABLE)	43%	63%
Keyword accuracy	89%	96%
Tools in context	56	36
False refusal rate (AQRR)	67%	Improved

The lesson

Model selection accounts for roughly 30% of answer quality in domain-specific AI. The other 70%—retrieval architecture, prompt engineering, tool design, data pipeline correctness, and evaluation infrastructure—is where most quality problems live and where most quality improvements come from. We proved this by holding the model constant and improving everything around it.

This is why we invest in evaluation infrastructure, not just model upgrades. A 20 percentage point quality improvement with no change to the underlying model is a stronger foundation than chasing the next model release.

Want to see how our evaluation framework works in practice?

Get in touch