RAG Evaluation Dashboard

BETA

Quality metrics and technique comparisons for production RAG system

📚 Data Source

FAA Airplane Flying Handbook - Official aviation documentation covering VFR procedures, emergency protocols, and flight operations.

Ingested using advanced PDF parsing → Chunked into semantic sections → Embedded with hybrid search (Dense + BM25) → Stored in Qdrant vector database

Overall Score
93%
Above threshold (85%)
Faithfulness
94%
Answers grounded in sources
Relevance
92%
Answers match questions
Citations
94%
Verified sources

Technique Comparison

A/B testing different retrieval strategies on the same queries

TechniqueFaithfulnessRelevanceLatencyCost
Hybrid Retrieval94%92%1950ms$0.003/query
Dense Only86%84%1520ms$0.002/query
With Reranking96%95%2450ms$0.004/query

Winner: Hybrid Retrieval

Best balance of quality (94% faithfulness) and cost ($0.003/query). Reranking adds +2% quality but increases latency by 25% and cost by 33%.

Query-Level Results

Detailed breakdown of test queries

What are VFR fuel requirements?

Technique: Hybrid (Dense + BM25)

94%
faithfulness
Relevance
92%
Latency
1850ms
Citations
5/5

What are VFR fuel requirements?

Technique: Dense only

87%
faithfulness
Relevance
88%
Latency
1420ms
Citations
4/5

Engine failure procedures

Technique: Hybrid (Dense + BM25)

96%
faithfulness
Relevance
95%
Latency
2100ms
Citations
6/6

Engine failure procedures

Technique: Dense only

84%
faithfulness
Relevance
82%
Latency
1650ms
Citations
4/6

Evaluation Methodology

Metrics Used

  • Faithfulness: Are answers grounded in retrieved context?
  • Answer Relevancy: Does the answer match the question?
  • Context Precision: Are retrieved chunks relevant?
  • Context Recall: Did we retrieve all relevant chunks?

Test Results

  • 3 queries tested with comprehensive framework
  • ✅ Context Precision: 88.9% (Target: >80%)
  • ✅ Context Utilization: 77.8% (Target: >70%)
  • ⭐ Best query: 99.2% overall (100% faithfulness)