RAG Evaluation Dashboard
BETAQuality metrics and technique comparisons for production RAG system
📚 Data Source
FAA Airplane Flying Handbook - Official aviation documentation covering VFR procedures, emergency protocols, and flight operations.
Ingested using advanced PDF parsing → Chunked into semantic sections → Embedded with hybrid search (Dense + BM25) → Stored in Qdrant vector database
Technique Comparison
A/B testing different retrieval strategies on the same queries
| Technique | Faithfulness | Relevance | Latency | Cost |
|---|---|---|---|---|
| Hybrid Retrieval | 94% | 92% | 1950ms | $0.003/query |
| Dense Only | 86% | 84% | 1520ms | $0.002/query |
| With Reranking | 96% | 95% | 2450ms | $0.004/query |
Winner: Hybrid Retrieval
Best balance of quality (94% faithfulness) and cost ($0.003/query). Reranking adds +2% quality but increases latency by 25% and cost by 33%.
Query-Level Results
Detailed breakdown of test queries
What are VFR fuel requirements?
Technique: Hybrid (Dense + BM25)
What are VFR fuel requirements?
Technique: Dense only
Engine failure procedures
Technique: Hybrid (Dense + BM25)
Engine failure procedures
Technique: Dense only
Evaluation Methodology
Metrics Used
- • Faithfulness: Are answers grounded in retrieved context?
- • Answer Relevancy: Does the answer match the question?
- • Context Precision: Are retrieved chunks relevant?
- • Context Recall: Did we retrieve all relevant chunks?
Test Results
- ✅ 3 queries tested with comprehensive framework
- ✅ Context Precision: 88.9% (Target: >80%)
- ✅ Context Utilization: 77.8% (Target: >70%)
- ⭐ Best query: 99.2% overall (100% faithfulness)