QA-Grade RAG Evaluation
6-layer deep analysis. Find exactly where your RAG pipeline breaks, prove it with evidence, and get actionable fix suggestions.
Open Dashboard6-Layer Evaluation Pipeline
Not just a score — a complete diagnosis of what went wrong and why.
Layer A: Retrieval Metrics
Precision@K, Recall@K, NDCG, MRR, Hit Rate — classical IR metrics calculated from relevance labels.
Layer B: Generation Quality
LLM-as-judge for faithfulness and answer relevance. Multi-judge consensus with agreement tracking.
Layer C: Claim Verification
Every claim decomposed and mapped to evidence spans. Fuzzy matching with support classification.
Layer D: Root Cause Cascade
17 failure codes with deterministic 6-step cascade. Severity mapping from blocker to none.
Anomaly Detection
5 cross-metric contradiction rules. Catches when metrics disagree, indicating evaluation issues.
Fix Suggestions
Actionable remediation for each root cause. Prioritized by severity with specific targets.
How It Works
Send
Submit query, response, and retrieved contexts to the API.
Evaluate
6-layer pipeline runs: retrieval, claims, generation, root cause, anomalies.
Fix
Get verdict, root cause diagnosis, and prioritized fix suggestions.
Sample Evaluation
"What is the capital of France?" evaluated against retrieved context.