RAG Challenges

Many developers, both within GovTech and the wider public service have built RAG applications using their own pipelines. However, only a small proportion of them have been put into production for large-scale usage. Many of these applications end up as prototypes used by a small handful of users (<10). Some reasons for this include:

1. Lack of Infrastructure Support

It is easy to build an RAG application using an OpenAI account and open-source libraries like Langchain and Llama-Index, and host the app on a development machine (e.g. laptop). However, to deploy something in production, one would need access to an Azure OpenAI account within the Government Commercial Cloud (GCC) if they are using GPT models, or models hosted by other Cloud Service Providers (e.g. Gemini by Google Cloud Platform or open-source HuggingFace models hosted using AWS Sagemaker). Not all developers have access to these accounts as they require a subscription.

Secondly, a simple prototype can be built with everything running from a development machine. A folder on the machine can be used to store the documents which can be added, modified and removed manually when necessary. For production grade systems, proper knowledge bases supporting CRUD (Create, Read, Update, Delete) are needed to ensure the consistency of the source of truth. Again, some developers might not have access to such knowledge bases, either on GCC or on a server.

Thirdly, unlike a prototype which can run on a single machine, production grade applications need to run on multiple machines or scale on demand as the number of users increases. These machines, either physical or virtual, might not be accessible to the developers and might be costly to acquire, run, and maintain.

Fortunately, developers facing this category of problems can make use of GovTech’s Central LLM Tech Stack to build and deploy their RAG applications. Details on how to use the LLM Tech Stack can be found here.

2. Lack of Software Engineering and DevOps Expertise

Software frameworks like LangChain and Llama-Index have made it very easy for anyone with basic coding skills to build an RAG application. However, productionising an application requires a set of skills which is more extensive than building a prototypical RAG application.

After acquiring the resources mentioned in the point above, DevOps engineers would need to set up the infrastructure, and software engineers would need to write the code such that the app can scale well. Unit tests have to be done, and data engineering work might also be needed to set up a pipeline to continuously pull data from specified data sources and efficiently transform them into a format which the RAG application can use. Unfortunately, not all RAG app developers (e.g. business analysts and data scientists) have such proficiencies. Without the proper engineering support, it will be a challenge for them to bring their application from prototype to production.

3. Change Management and Integration

Most RAG applications built to date, especially within the public service, do not run autonomously. Instead, they act as assistants to help humans complete their tasks more efficiently. Some RAG applications see delay in productionisation because changes are needed to the existing workflows and sometimes, extensive discussions are required to sort this out. For example, the workflow might require multiple steps with some manual, and others AI-assisted. How should the user experience be designed to integrate these steps to improve the effectiveness of the user? The RAG application cannot be productionised before the overall architecture design is confirmed. What usually happens is that the prototype is deployed as a standalone system. As an interim solution, users will deviate from the existing workflow, get what they need from the RAG application, and continue the workflow from the application’s response.

4. Poor Performance, including Challenges in Evaluating Performance

Another reason RAG applications are not productionised is because the users deem the results “not good enough”. As previously mentioned, with current software libraries and frameworks, it is easy to build an RAG prototype - putting in around 20% of the effort gets you an RAG application with 80% performance, and this can be done simply by following some cookbook or tutorial. To improve the performance by a further 20% would require the next 80% of the effort. For example, the dataset has to be studied in more detail and optimisations have to be made to various components of the pipeline.

To determine what is “good enough”, what is not “not good enough”, and if changes made to the pipeline resulted in any performance improvement, we need to first have a way to objectively measure the performance of an RAG pipeline. Unfortunately, most RAG application developers do not have a proper evaluation framework to assess the performance of their RAG output with the users. Evaluation is often done qualitatively whereby changes are made and users are asked to try out the application and give their feedback. The feedback received in this manner is often overly specific, focusing on individual cases instead of the overall performance of the system.

Without a proper evaluation metric and acceptance criteria, RAG applications will almost never get deployed: Since it is practically impossible to build a perfect RAG application, users will find errors at every evaluation iteration and the developer will be asked to fix it. In addition to that, manual evaluation is tedious and time consuming, and every evaluation iteration can take weeks or even months, leading to further delays in productionising the application.