How we measure quality in legal AI (and why most vendors don’t)

Most legal AI vendors report accuracy metrics based on keyword overlap. The test is simple: does the response mention the right statute name, the correct section number, the relevant legal term? If the answer mentions "Employment Rights Act 1996," it scores as a pass. The problem is that a response saying "I cannot provide specific information about the Employment Rights Act 1996" also scores as a pass. The keyword is present; the answer is useless.

This is not a hypothetical failure mode. When we first evaluated our own system using keyword accuracy alone, we scored 89%. When we then assessed the same responses qualitatively—asking whether a practitioner could actually use the answer—the usable rate was 43%. Nearly half the gap between those numbers was attributable to responses that mentioned the right terms but either refused to elaborate, provided incomplete information, or hedged so aggressively that the substance was lost.

A 5-layer evaluation framework

We now evaluate PrivateNode across five distinct layers, each measuring a different aspect of system quality.

Layer 1: Routing evaluation. Does the question reach the right specialist? A query about collective redundancy consultation should route to the Employment Law specialist, not the General Assistant. Routing errors mean the downstream answer is built on the wrong knowledge base, regardless of how good the model is.

Layer 2: Keyword accuracy. Does the answer contain the expected terms, statute references, and section numbers? This is necessary but not sufficient. It catches gross retrieval failures but does not measure answer quality.

Layer 3: Qualitative evaluation. An LLM-as-judge rates each answer as GOOD, ACCEPTABLE, POOR, or FAIL based on completeness, accuracy, and practical utility. This is where the real quality signal lives. A GOOD answer provides specific, actionable information that a professional could rely on. An ACCEPTABLE answer is directionally correct but incomplete. POOR and FAIL answers are not usable.

Layer 4: Expert panel evaluation. Do multi-specialist queries route correctly across panels? A question about the tax treatment of a property transaction should draw on Tax, Property, and potentially Legal specialists. This layer tests the orchestration logic, not just individual specialist quality.

Layer 5: End-to-end cookbook workflows. Seventy real-world scenarios tested across all domains, from "Calculate SDLT on a £450,000 residential purchase by a first-time buyer" to "Summarise the directors and filing history of a specific company." These are not synthetic benchmarks—they are the actual queries professionals ask.

The false refusal problem

The most insidious quality failure in domain-specific AI is the false refusal: the system has the data to answer a question but responds with "I don't have enough information" or "I cannot provide specific legal advice." We track this with a dedicated metric called AQRR—Answerable Question Refusal Rate. If a user asks about redundancy consultation periods and the system has TULRCA 1992 s.188 indexed with the full text of the provision, a refusal is a quality failure, not a safety success.

False refusals are particularly common in legal AI because the standard approach to safety—adding guardrails like "NEVER provide legal advice" and "ALWAYS recommend consulting a solicitor"—trains the model to refuse rather than attempt. The result is a system that is technically safe but practically useless. Our approach is different: answer first, caveat later. Partial information is still useful. A system that provides the relevant statutory provision and notes that professional advice should be sought is more valuable than one that refuses to engage.

Current results

As of our most recent evaluation cycle: 96% keyword accuracy across 70 workflows, 63% qualitative usable (GOOD + ACCEPTABLE). The 33 percentage point gap between those two numbers is exactly the gap between "mentioned the right terms" and "actually gave a useful answer." That gap is what we are closing, sprint by sprint, through retrieval improvements, prompt refinement, and better evaluation—not by swapping models.

When evaluating legal AI for your firm, ask the vendor for their qualitative evaluation results, not just keyword accuracy. Ask for their false refusal rate. If they cannot show you those numbers, they are not measuring what matters.

Want to see our evaluation results in detail?

Get in touch