What is Retrieval Augmented Generation (RAG)?

rag arch2 1

Today, all the leading Large Language Models (LLM) are built on Transformer Architecture. They are trained on the entirety of Internet data and other literature sources. These models are really good at predicting the next token based on their existing knowledge base. Model uses its knowledge and generates responses for the user’s questions. When we ask the questions about our proprietary data, we don’t get the accurate answers since the model is not trained on our data. However, real world ,production grade LLM applications need to have access to internal proprietary data-custom data. Thus, the model lacks this context-related information. RAG technique comes to the rescue. This technique is used to enhance capabilities of the Pre-Trained model.

RAG is a part of context-engineering – providing more info, knowledge to the model via prompt. The data fed into the model is of custom ,real-time, fast changing, dynamic data type.

The pre-trained model gets access to the knowledge base of custom data using the RAG pipeline. The RAG’s retriever component fetches the relevant data from vector dB ,creates small chunks and top semantic chunks are injected into LLM along with your question. Hence, we will get answers to more personalized queries.

To implement a RAG pipeline ,we need to split the dataset into semantic chunks-called as documents. These chunks are stored into vector databases such as Pinecone, Chroma ,Redis Vector, as vector embeddings with the help of embedding algorithms. To fetch the most relevant chunk from the database, we need to convert the user question into vector embeddings , later perform a semantic search. Thus we can retrieve the context-required information. The retrieved context along with the user question will be sent as prompt to LLM for generating the response.

1) Traditional RAG-as described in the above figure.

2) Agentic RAG-combining the RAG pipeline with an AI Agent. They are more powerful, have the capabilities to make tool calls to access the external real-time data sources.

RAG has some demerits too – adds some latency to inference and significantly increases the context window usage of the LLM; nonetheless, the implementation is not complex.

RAG is really useful when you need to access the real-time external data sources, if you have historical data-then fine-tuning LLM could be an option to consider.

By integrating the RAG pipeline with LLM, we get the best of both-having access to massive LLM knowledge and access to proprietary data.

RAG is adding more value for enterprise use cases. Building and deploying such systems is not complex, compared to building or fine-tuning our own model. It’s fueling the Generative AI space with rapid adoption rate across various industries.


Leave a Reply

Your email address will not be published. Required fields are marked *