Conclusion

Key Takeways

Based on our experiment findings, we would like to highlight four key takeaways for readers:

1. Recognize the importance of systematic evaluation

Our experiments outline a general approach to building and evaluating RAG applications in context-specific use cases. Once again, we stress that our experiment findings are not meant to be prescriptive statements about which methods universally improve RAG performance. Instead, we hope readers find our experiments instructive in designing and iterating on their own RAG systems. Like any well-planned machine learning task, having a clear use case, appropriately defined metrics and a relevant evaluation dataset enables developers to make systematic, data-driven decisions about how RAG performance can be improved.

2. Understand the problem space

Our experiments also demonstrate how the modelling demands vary significantly across different use cases, even though RAG is a technology applicable in many domains. For Hansard data, issues are more likely to span across multiple documents, while for judiciary data, highly-technical language places greater demands on LLMs’ semantic capabilities. These gave rise to different implications for the pipelines we built:

MAP@K, rather than MRR@K, was more relevant for Hansard to ensure we captured information across multiple documents.
Recursive and hybrid search performed better for judiciary data due to the higher prevalence of keywords.
Complexity in legal jargon meant that strong retriever performance didn’t translate to strong generative performance, creating an impetus for LLM fine-tuning.

3. Prioritise low-hanging fruit

We also recommend readers to systematically approach pipeline building by prioritising improvements that are easier to implement. Cohere’s Rerank API is a perfect example of this - with Langchain integrations it offers performance improvements with a few lines of code, without increasing pipeline complexity by much. Another example is managed services like AWS Bedrock, which provide an excellent starting point for a readily deployable RAG solution.

4. Trust, but verify

It is typical for any emergent area of technology to be surrounded by hype. We recommend readers to approach new ideas in RAG with an open mind, but to always verify that ideas improve performance in their use case rather than take academic findings at face value. This would only be possible with a systematic approach to RAG evaluation, as highlighted in point (1). For example, Hypothetical Document Embeddings (HyDE) presents an intuitive and compelling way to improve retriever performance, but in our experiments, it was detrimental to performance. This is not to say that HyDE is never a good solution, but that, in the context of Hansard and judiciary data, hypothetical documents led to greater semantic distance between queries and documents.

Reference Architectures

In addition to our key takeaways, we also provide two reference architectures, one more complex than the other, for custom Langchain RAG pipelines based on our experiment results. These reference architectures serve as a starting point for readers seeking guidance on where to begin their custom pipeline development so that they may iterate more quickly.

Reference #1: Simple RAG

This simple architecture is aimed at getting users started quickly. As such, all it requires is a local python environment and no components are CSP-dependent.

Preprocessing: Langchain RecursiveCharacterTextSplitter
- Splits on line breaks and paragraphs, likely to preserve some semantic structure
Embedding: OpenAI text-embedding-ada-002
- Simple to gain free trial access via OpenAI
Indexing: ChromaDB
- Open-source vector database readily available on Langchain which requires minimal setup and offers local persistence
Retrieval: Semantic search
- ChromaDB performs semantic search support by default
Completion: OpenAI GPT-3.5
- Also available for free trial access via OpenAI

Reference #2: Advanced RAG

This architecture requires access to some cloud resources but provides better scalability overall, enabling more systematic and collaborative experimentation. While the overall pipeline is technically more complex than #1, components remain easy to incorporate using Langchain.

Preprocessing: Langchain RecursiveCharacterTextSplitter
Embedding: Finetuned embeddings
- Better performance but requires finetuning of embedding model using sentence pairs from own documents
Indexing: AWS OpenSearch
- Requires more setup than ChromaDB but highly scalable and enables consistency if working in teams
Retrieval: Hybrid Search + Cohere Reranker
- Hybrid search parameters must be tuned but can offer better results if keywords are common in target corpus
- Cohere Reranker demonstrated improvements in our Hansard and Judiciary experiments and is extremely simple to implement in Langchain
Completion: Finetuned completion model
- Completion model can be finetuned to answer questions more accurately, given the correct context, but training data is required

Suggestions for Performing Evaluations

Drawing from our experiments, we also provide the following suggestions for readers who want to structure their own RAG experiments and evaluation:

Identify the key evaluation metrics relevant to your use case (e.g., answer relevancy, faithfulness, retrieval recall)
Prepare a representative evaluation dataset with queries and ground truth answers
- LLMs can be used to generate synthetic datasets and to perform evaluation ¹
- Synthetic datasets should be prepared with the RAG use case in mind - ideally, it must be representative of prospective user queries
Implement evaluation functions or use existing libraries to compute the chosen metrics.
Perform systematic experiments to evaluate each pipeline component being considered and use metrics to support data-driven decisions towards an optimal pipeline

Future Work

In this playbook, we have explored component-based modifications to build better RAG pipelines. Moving forward, we will explore the use of agents and knowledge graphs to further improve the performance. The playbook will be updated to reflect the results of our exploration.

While using LLMs for evaluation has become mainstream, studies into LLM self-bias are still an emergent field. Some studies suggest that LLMs have a tendency to prefer their own outputs, such as https://arxiv.org/pdf/2404.13076. ↩