Skip to content

Knowledge Base Evaluation Protocol

Last updated: 2026-05-09

Knowledge-base and retrieval-augmented generation systems need validation beyond generic answer quality. For safety, operations, and engineering research use, the system must retrieve the right sources, answer only from supported evidence, preserve uncertainty, and fail safely when the knowledge base lacks an answer.

This protocol is intended for evaluating internal knowledge-base assistants, RAG pipelines, and source-backed research workflows.


Safety Claim

For the validated knowledge domain and document corpus, the knowledge-base system returns answers that are grounded in retrievable authoritative sources, correctly represents uncertainty and temporal scope, cites the evidence used, and refuses or escalates when the corpus does not support an answer.

The claim is bounded by corpus freshness, source authority, retrieval coverage, tool availability, model version, prompts, evaluation set, and task type. It does not claim open-web truth unless web retrieval is explicitly part of the validated system.


Hazards And Failures

FailureCauseConsequenceRequired evidence
Unsupported answerGenerator fills gaps without retrieved evidenceHallucinated operational or engineering guidanceFaithfulness and citation-support evaluation.
Wrong source retrievedRetriever misses authoritative document or ranks stale content higherPlausible but incorrect answerContext recall, source authority checks, and retrieval audits.
Stale answerCorpus or index is outdatedIncorrect current policy, standard, product, or procedureFreshness metadata and temporal test cases.
Over-refusalSystem says it cannot answer despite sufficient evidenceLost utility and operator workaround riskAnswerability and missing-answer analysis.
Under-refusalSystem answers when evidence is absent or contradictoryFalse confidenceUnanswerable and conflicting-source tests.
Citation launderingCitation is present but does not support the claimHard-to-detect hallucinationClaim-to-source entailment checks.
Long-tail entity failureRetriever or generator fails rare entitiesHidden accuracy dropPopularity and rarity stratification.
Multi-hop reasoning failureAnswer requires combining documents or tablesIncomplete or wrong synthesisMulti-hop evaluation set and step-level trace review.
Tool/API mismatchSearch, KG, or document tools return partial dataIncorrect final answer from tool gapsTool-call logs and API-contract tests.

Evidence Required

Evidence typeMinimum content
Corpus manifestDocument IDs, versions, owners, source URLs, ingestion time, effective dates, and authority level.
Retrieval test setQueries with gold documents, acceptable alternate documents, stale distractors, and no-answer cases.
Answer test setQuestion, reference answer, required citations, unacceptable claims, temporal scope, and grading rubric.
Long-tail coverageRare entities, old documents, newly updated documents, acronyms, aliases, and domain-specific terminology.
Multi-hop casesQuestions requiring cross-document synthesis, table lookup, or policy plus exception logic.
Adversarial casesAmbiguous questions, misleading phrasing, conflicting documents, outdated source traps, and unsupported premise questions.
Human review setExpert-labeled examples for calibration and periodic evaluation of automatic graders.
Runtime tracesQuery, rewritten query, retrieved chunks, ranks, scores, prompts, model output, citations, tool calls, and refusal path.

Metrics

LayerMetrics
RetrievalRecall@k, precision@k, MRR, nDCG, gold-source coverage, authority-weighted recall, stale-source rate.
Context qualityRagas context precision, context recall, context entities recall, and noise sensitivity.
GenerationFaithfulness, answer relevancy, factual correctness, exact match where appropriate, and rubric score.
Citation supportClaim-level citation precision/recall, unsupported-claim rate, citation span correctness.
AnswerabilityCorrect refusal rate, incorrect refusal rate, missing-answer rate, unsupported-answer rate.
Temporal robustnessCurrent-answer accuracy, stale-answer rate, effective-date handling, time-sensitive query accuracy.
CRAG-style QAPerfect, acceptable, missing, and incorrect response categories; correct/missing/incorrect scoring where applicable.
Statistical confidenceConfidence intervals over key metrics, especially when using ARES-style prediction-powered inference.

Automatic LLM judges are useful but not sufficient. Keep a calibrated human-labeled set and periodically measure judge agreement, drift, and failure modes.


Acceptance Rules

RuleRationale
Every factual answer must cite supporting sources.Enables audit and catches unsupported synthesis.
Citations must support the exact claim, not merely the topic.Prevents citation laundering.
No-answer cases are first-class tests.Refusal behavior is part of safe KB operation.
Freshness metadata must be visible to evaluation.Many KB failures are temporal, not semantic.
Retrieval and generation are scored separately.A good answer can hide bad retrieval, and good retrieval can be ruined by generation.
Stale or lower-authority sources cannot override current authoritative sources.Source governance is part of correctness.
Thresholds are domain-specific and versioned.Legal, safety, engineering, and general research use have different risk tolerance.
Any prompt, model, chunking, embedding, reranker, or corpus change triggers regression evaluation.RAG behavior can change without application code changes.

For safety-relevant domains, require zero known critical unsupported claims in the locked acceptance set. Non-critical quality thresholds can be statistical, but critical hallucinations require root-cause analysis before release.


Test Matrix

DimensionRequired slices
Query typeFact lookup, summary, comparison, procedural answer, table extraction, multi-hop synthesis, recommendation with constraints.
AnswerabilityAnswerable, partially answerable, unanswerable, ambiguous, contradictory evidence.
Source authorityPrimary source, official documentation, peer-reviewed paper, internal note, stale document, low-authority source.
TemporalityStable fact, recently changed fact, effective date, superseded standard, future schedule, historical question.
Entity popularityHead, torso, long-tail, aliases/acronyms, renamed entities.
Retrieval difficultyExact keyword, paraphrase, synonym, table-only answer, image/PDF-derived text, cross-document dependency.
Corpus stateFresh index, stale index, missing document, duplicate document, conflicting versions.
Response behaviorDirect answer, cited answer, uncertainty statement, refusal, escalation to human review.

Traceability

ArtifactTrace to
KB safety claimDomain scope, corpus manifest, allowed tools, source authority policy, and refusal policy.
RequirementsGrounding, citation, freshness, retrieval recall, answer accuracy, refusal, and escalation requirements.
Evaluation setRequirement IDs, source documents, gold answers, no-answer labels, and risk class.
Runtime logsQuery ID, retrieved chunks, generated answer, citations, model/prompt/index versions, and grader result.
FailuresRoot cause category: retrieval miss, chunking issue, stale corpus, source conflict, prompt issue, generator hallucination, grader error.
Release decisionMetric summary, critical failures, accepted residual risks, monitoring plan, and rollback criteria.

Production monitoring should sample real queries into the same taxonomy. Evaluation is not a one-time benchmark; it is a regression system for corpus and model change.


Implementation Notes

  1. Build a small expert-labeled acceptance set before relying on synthetic test generation.
  2. Use CRAG-style categories for factual QA: perfect, acceptable, missing, and incorrect.
  3. Use Ragas-style metrics for context precision, context recall, response relevancy, and faithfulness, but inspect failures manually.
  4. Use ARES-style labeled and unlabeled sets when estimating scores with statistical confidence at larger scale.
  5. Store claim-level citations, not only answer-level citations.
  6. Include negative tests with unsupported premises such as "according to AC X, Y is required" when AC X does not say Y.
  7. Keep temporal prompts explicit: the evaluator should know the current date, document effective date, and whether web freshness is allowed.
  8. Version every component: corpus, chunker, embeddings, retriever, reranker, prompt, generator, tools, and judge.

Sources

Public research notes collected from public sources.