Open Source RAG Evaluation

QA-Grade RAG Evaluation

6-layer deep analysis. Find exactly where your RAG pipeline breaks, prove it with evidence, and get actionable fix suggestions.

6-Layer Evaluation Pipeline

Not just a score — a complete diagnosis of what went wrong and why.

Precision@K, Recall@K, NDCG, MRR, Hit Rate — classical IR metrics calculated from relevance labels.

LLM-as-judge for faithfulness and answer relevance. Multi-judge consensus with agreement tracking.

Every claim decomposed and mapped to evidence spans. Fuzzy matching with support classification.

17 failure codes with deterministic 6-step cascade. Severity mapping from blocker to none.

5 cross-metric contradiction rules. Catches when metrics disagree, indicating evaluation issues.

Actionable remediation for each root cause. Prioritized by severity with specific targets.

STEP 1

Submit query, response, and retrieved contexts to the API.

STEP 2

6-layer pipeline runs: retrieval, claims, generation, root cause, anomalies.

STEP 3

Get verdict, root cause diagnosis, and prioritized fix suggestions.

"What is the capital of France?" evaluated against retrieved context.

PASSScore: 1.0012ms | Free

1.00

PRECISION@K

1.00

RECALL@K

1.00

MRR

1.00

NDCG@K

1.00

HIT RATE@K

Claims: 1/1 supported (100%)

PASS