LLM 101: RAG is all you need

Introducing RAG

I've explained it briefly in past LLM 101 posts on this blog already. But just to paint in some finer details:

An RAG, or Retrieval-Augmented Generation, is a technique used to enhance the capabilities of Large Language Models (LLMs) by incorporating external knowledge sources. It addresses common LLM limitations, such as outdated information and the tendency to produce inaccurate "hallucinated" content, by retrieving relevant information from external data sources and incorporating it into the LLM's context.

RAG is commonly used to build Generative Q&A applications or systems by infusing a private or custom knowledge base of documents with an foundational model or LLM. Some products or services label this emerging architecture as "embedding-based RAG" by the way.

Why RAG?

RAG is particularly valuable in domain-specific applications, where LLMs can be "taught" to specialize in a particular subject matter or search across specific enterprise datasets or documents.

Anyhow! RAG system architectures feature a critical design component and mechniasm called the "retriever." It's pretty neat.

The retriever's job is to find relevant documents from a large text corpus. These documents provide context to a Language Learning Model (LLM) during the generation process.

In simple terms, the retriever helps the LLM access external data, like text or images converted into vector indexes. Vector indexes quickly retrieve the most relevant documents from a corpus, which are then used by the LLM to answer user queries.

So, actually, both "vector indexes" and "retirevers" remain as crucial components of larger RAG-based LLM systems, as they enable efficient retrieval of relevant documents from a large corpus, which are then used as context for the LLM to generate answers to user queries.

The interaction between vector indexes and retrievers can be designed and combined in various ways. Many researchers and practitioners within AI right now are exploring different combinations, and measuring how such combinations impact RAG accuracy levels and downstream LLM tasks.


Typical architecture of LLM-based RAG systems

High-level example of LLM-based RAG system utilizing a
Foundational Model provider's API and LLM services

RAG system architectures often combine a "retrieval component" or sub-system together with a foundational LLM generation module.

Major foundation model companies have opened-sourced "embedding" and various "API interfaces", as well as software development kits (SDKs) like LangChain, that have all accomodated the RAG system architecture into their core capabilities and offerings.

As mentioned above, the overall effectiveness of a RAG system depends on the "retrieval" component's ability to identify relevant context passages and in parallel, the LLM's ability to exploit these passages faithfully and contextually.

The retrieval component of RAG systems can be dense or sparse, and it is responsible for retrieving relevant passages or documents from an Information Retrieval (IR) system. This process is different from a traditional relational database query, as the retrieval system does not store or manage data in memory in a structured way like a database. Instead, it retrieves information based on "relevance to a given query".

The retrieved information at this point, can be considered as a form of data that is used as input to the LLM component of the RAG system. Therefore, the retriever component can be seen as providing data to the LLM in a way that is similar to how a database provides data to a query - but don't be confused - the retriever component is not strictly a database.

As it stands, the architecture of LLM-based RAG systems are also being enhanced by incorporating a "rank head" mechanism that assesses the relevance of retrieved documents or dataset segments. These improvements in architecture and training significantly outperform previous RAG approaches because they act like reccomendation systems.

The RAGGED framework, for example, is currently the go-to benchmark for analyzing and optimizing different RAG systems and approaches.

Additional metrics exist, for measuring how well certain RAG systems effectively rank and retrieve relevant documents or dataset segments like the Mean Reciprocal Rank (MRR) or Normalized Discounted Cumulative Gain (NDCG).

MRR is a metric that evaluates how quickly a ranking system can present the first relevant item to the user. It focuses on the position of the first relevant item in the list of recommendations.

NDCG is a metric that evaluates the quality of the ranking by considering both the relevance of items and their positions in the list. It compares the actual ranking to an ideal ranking where the most relevant items are at the top.

Both MRR and NDCG are used to evaluate the performance of ranking models in various applications, including Visual Dialog systems and Query Auto-Completion. They are complementary metrics, as MRR focuses on the rank of the first correct answer, while NDCG takes into account the relevance of all the correct answers.

However, there are a few key learnings:

  • RAG is only as good as the retrieved documents’ relevance, density, and detail. Ensure the documents you fuse together with an LLM are detailed and textually dense for better results.
  • Like traditional recommendation systems, the rank of retrieved items within a RAG will have a significant impact on how the combined LLM performs on additional downstream tasks. So, asking follow-up questions, for example.
  • If two documents are equally relevant and ranked highly, it's important to ensure the retrieval componet prefers one document that’s more concise and has fewer extraneous details other the other.

RAG vs Fine-tuning

Both RAG and finetuning can be used to incorporate new information into LLMs and increase performance on specific tasks or workstreams. However, the specific benefits and advantages of each approach depend on the context and the nature of the domain-specific knowledge.

There appears to be research indicating that RAG shows greater performance boosts compared to finetuning (FT) in developing AI-driven knowledge-based systems.

According to the one key study, RAG-based constructions are, on average, more efficient than models produced with FT in terms of ROUGE, BLEU, and cosine similarity scores. The study also outlines a simple RAG-based architecture that outperforms FN models by 16% in terms of the ROUGE score, 15% in the case of the BLEU score, and 53% based on the cosine similarity.

Additionally, other papers indicate the practical effects of RAG over fine-tuning as well. One study mentions that RAG systems are particularly useful in enterprise settings or any domain where knowledge is constantly refreshed and cannot be memorized within an LLM. Keeping retrieval indices up-to-date in RAG systems is easier and more cost-effective than continuous pretraining or fine-tuning methods.

But wait, there is an second advantage, the FIT-RAG paper discusses the ability to address problematic documents in retrieval indices. If retrieval indices have documents containing toxic or biased content, RAG systems allow for easily dropping or modifying the offending documents, ensuring a more controlled and safe environment for information retrieval.

Long Context vs RAG

Since new LLMs with larger context windows, take Gemini 1.5 with 10M
is a game-changer for business use cases such as analyzing multiple documents or chatting with multiple PDFs at once - RAG still has a role to play.

There will be times where it makes sense to point an LLM towards an enterprises entire documentation or knowledge bank. But it comes at a cost, it's heavy on inference and difficult to reason with correctly right now. So, there will certainly remain other times where utilizing the RAG architecture will prove more useful, agile, nible and cheaper.

Example of LLM-based RAG products or services:

  • Google's new NotebookLM
  • Notion AI workspace search
  • Unstructured AI bespoke RAG system for Enterprises
  • Open-Source projects like LLM-Search
  • Building LLMs on Databricks article is a fun read