Open Source RAG Evaluation

QA-Grade RAG Evaluation

6-layer deep analysis. Find exactly where your RAG pipeline breaks, prove it with evidence, and get actionable fix suggestions.

Open Dashboard

6-Layer Evaluation Pipeline

Not just a score — a complete diagnosis of what went wrong and why.

Layer A: Retrieval Metrics

Precision@K, Recall@K, NDCG, MRR, Hit Rate — classical IR metrics calculated from relevance labels.

Layer B: Generation Quality

LLM-as-judge for faithfulness and answer relevance. Multi-judge consensus with agreement tracking.

Layer C: Claim Verification

Every claim decomposed and mapped to evidence spans. Fuzzy matching with support classification.

Layer D: Root Cause Cascade

17 failure codes with deterministic 6-step cascade. Severity mapping from blocker to none.

Anomaly Detection

5 cross-metric contradiction rules. Catches when metrics disagree, indicating evaluation issues.

Fix Suggestions

Actionable remediation for each root cause. Prioritized by severity with specific targets.

How It Works

STEP 1

Send

Submit query, response, and retrieved contexts to the API.

STEP 2

Evaluate

6-layer pipeline runs: retrieval, claims, generation, root cause, anomalies.

STEP 3

Fix

Get verdict, root cause diagnosis, and prioritized fix suggestions.

Sample Evaluation

"What is the capital of France?" evaluated against retrieved context.

PASSScore: 1.0012ms | Free
1.00
PRECISION@K
1.00
RECALL@K
1.00
MRR
1.00
NDCG@K
1.00
HIT RATE@K
Claims: 1/1 supported (100%)
PASS