Skip to main content

What is RAG (Retrieval-Augmented Generation)?

Building AI-powered knowledge retrieval systems leveraging advanced search techniques

For years, AI models have run into the same problem. When a model finishes training, its knowledge base frozen in time. Models do not learn in real time when you chat with them. Once it’s done learning, it can’t learn anymore.

RAG is used to solve this problem. By the time you’re finished with this guide, you’ll be able to answer the following questions.

  • Why can’t AI models learn in real time?
  • What is RAG?
  • How does RAG address the knowledge cutoff?
  • How do RAG systems actually work?

How pretraining works and AI knowledge cutoffs

Before we get into RAG, we need to understand how AI training actually works. Models can’t learn on the fly so AI training is often broken into the following steps.

  1. Data Preparation: You need to select your data sources. These could be anything from web data to academic pieces. Learn more about curating your data here.
  2. Model Selection: You need to choose an architecture to use for your AI model. Modern LLMs are almost always based on the Transformer architecture and more specifically, GPT (the architecture, not OpenAI’s model branding).
  3. Training the Model: The model trains on its data over a set of rounds. Depending on importance, this could be one or two rounds — or even thousands!
  4. Fine-Tuning and Validation: Model outputs are tested and weights are adjusted to fine-tune the model’s output. This step is repeated until the model is finished.
  5. Testing and Deployment: After proper testing, models are deployed for real world usage. The training and fine-tuning are finished.

In the process above, models are deployed with a static knowledge base. They are completely unaware of anything outside their training data. AI models cannot learn in real time when talking to you.

Retrieval-Augmented Generation (RAG) definition

Retrieval-Augmented Generation (RAG) was originally proposed back in 2020 to help address the limitations of pretrained AI models. RAG uses an external medium, such as a vector database, for a model to access external knowledge. This is called retrieval. The model’s temporary knowledge is then augmented with this new data. Using this external knowledge base, the model generates better output.

RAG is often used to:

  • Personalize chatbots
  • Guard against hallucinations
  • Give models access to real time data
  • Customize models for a specific purpose (i.e. customer service or web scraping)

The applications of RAG are pretty much endless. If a model has good inference but static knowledge, RAG helps bridge the gap.

Why use RAG?

RAG allows us to expand our AI models with external knowledge. All sorts of systems use at least some form of RAG for better model output.

Imagine you’ve got a virtual assistant completely severed from the outside world. It knows its training data. It knows the current conversation (well, pieces of it) and that’s it. It only understands pieces of the conversation due to context limitations. It knows nothing of the outside world because it’s not connected to any external sources.

We’ll call this model bland and boring — Bab for short.

  • “Hey Bab, what’s the weather today?” -> “I don’t have access to that information.”
  • “Hey Bab, what are the current trends in AI?” -> “As of my last training update in 2021…”
  • “Hey Bab, how do I troubleshoot my 5G mobile connection?” -> “As of 2021, 5G data connections are becoming more widely available. That said, there’s not much data for troubleshooting their connectivity. Would you like troubleshooting tips for 4G instead?”

RAG gives Bab external data to answer with better context.

  • “Hey Bab, what’s the weather today?” -> Checks the weather -> “Today will be hot and sunny.”
  • “Hey Bab, what are the current trends in AI?” -> Checks news sites -> “As of 2025, RAG is one of the hottest topics in AI development.”
  • “Hey Bab, how do I troubleshoot my 5G mobile connection?” -> Reads documentation -> “5G data connections operate with incredible speed. However, their reception areas are quite small. If you’re traveling, see if the connection improves as you move. If you’re at home, try enabling WIFI calling for better stability.”

RAG applies to any external data source. It’s not limited to internet searches and it’s actually most optimized when the model searches a vector database.

RAG expands the model’s knowledge base. Something as bland and boring as Bab can become a useful digital assistant.

Core components of RAG systems

RAG System Diagram

RAG systems have two main parts — a retriever and a generator. However, these parts need to be connected. We use orchestration to handle the pieces so the system actually works.

Retriever

The retriever can pull data from almost any external source. Most commonly, it’s a vector database. Vectors are optimal for machine searching and comparison. If you’re familiar with computer science, you likely understand that most complex data structures — key-value pairs, dataframes, spreadsheets etc… are built on top of vectors.

When Bab performs a search for the weather, it may be through a weather API, a direct site lookup or something entirely different.

This brings us to our next question. How can you get API data from a vector database? Ideally, your database is getting maintained. In this scenario, the system running Bab would need to query the weather regularly — perhaps every hour — and then update the weather in the vector database.

Just to make the concept easy, we’ll stick with “hot and sunny” for the weather.

Generator

Once the retriever has found the correct information, the generator can churn out an answer. This is where generative models really excel in RAG systems.

If the model has no other data in the vector store, it might say, “Today is going to be hot and sunny.”

Perhaps the vector store holds more information. Maybe it knows the user’s name. Bab could say, “Good morning Jake! Today’s weather is going to be hot and sunny.”

We can even take this a step further. Perhaps the user prefers colorful language in their output and the model is aware of this due to the vector store. “Top of the mornin’ to ya, Jake! Today’s going to be a real scorcher!”

The role of the generator is pretty simple in concept — generate output. When using RAG, its output is related to the vector store.

Orchestrator

This is the layer that makes RAG actually work. Sure, you could manually look up the weather, inject it into a prompt and then finish the prompt by asking your model, “What’s the weather for today?” That said, it defeats the purpose of the RAG system itself.

Orchestration handles this process for us. Through our orchestration layer, the model is connected to tools and a memory store. When Bab addresses the user directly, “Good morning Jake!”, the orchestrator passes this memory into the generator.

That said, retrieval isn’t limited to the vector store. If the model sees that the weather information is missing or stale, it needs to perform a search of its own. Bab would do this using external tooling.

We’re starting to see standardization at the orhcestration layer. The most common one is the MCP Server. With an MCP server, your model is “plugged in” to an external tool through the orchestration layer. In our case with Bab, this is likely a weather API or something similar. Inputs, outputs and stored data are chained together so the model can handle its task with proper context.

Pretend Bab is getting weather information through an MCP server.

  1. Bab makes in ‘info’ or a ‘help’ query to the tool.
  2. The tool sends back a short summary of its methods and how to use them.
  3. Bab fetches the weather using the tool.
  4. Bab checks the user’s name within the vector store and generates our response, “Top of the mornin’ to ya, Jake! Today’s going to be a real scorcher!”

It’s not a part of the original RAG design, but without orchestration, the LLM wouldn’t be able to access its tools and memories.

Building a RAG pipeline step by step

  1. Collect Your Data: Before Bab starts generating custom outputs, you need to find data for it to work with. This could come from FAQs, articles, docs or even scraped web data.
  2. Chunk and Embed It: Large portions of text need to be chunked in ways that make sense. For perspective, GPT-4o has a context limit of 128,000 tokens (~90,000 words) — a 1,000 page book won’t fit all at once.
  3. Store In A Vector DB: Take these chunks and store them in a vector database. Common tools include LlamaIndex, Weaviate and Pinecone. This is the optimal place for your model to search information.
  4. Connect a Retriever: Wire a simple connection — such as an API — from your vector DB to your model. This is how your model retrieves data from your storage.
  5. Connect the Generator: Pipe results from the retriever into the LLM so it has real context to work with. Models don’t always need to retrieve, but it should be possible with every prompt.
  6. Orchestrate the Workflow: When the user asks a question, the model needs to decide whether to run retrieval or just answer from memory.
  7. Test and Tune: Talk to Bab. See what it gets right and what it gets wrong. Adjust your configuration — chunk sizes, retriever configuration and basic prompt templates as needed.

Advanced search techniques

Basic retrieval is just the start of RAG. For best results, Bab needs to level up the way it searches through its data sources.

Most RAG systems use semantic search, not just basic keyword matching. Bab’s vector database allows for semantic search. Old fashioned keyword matching still has its uses, but they’re limited. Combining both strategies gets more answers for Bab.

Filtering and metadata

Finding matches is one thing — narrowing them down is another. Semantics and keywords help find results. What happens when you get too many results? Instead of only searching “weather data”, Bab might search for “weather data within the last 24 hours.” When you tag or store metadata within the vector DB, you can filter through results matching that metadata — narrowing your results to only relevant ones.

Re-ranking and scoring

Bab might perform a search yielding 30 results. Maybe 25 of them are irrelevant, off topic or ads. If you add a ranking algorithm to your search, Bab can choose to read only the good results from the ranking algorithm. This can drastically reduce time and tokens spent on unneeded resources.

Multi-step retrieval

For real complex questions, Bab might need to chain several searches together. In this case, your orchestrator needs to chunk and compress each search progressively — each result gets compressed and fed into the next result as well for contextual understanding. While complex, this allows the model to search through near infinite results.

With more advanced methods, Bab isn’t just searching. Bab is searching wisely — saving your time and resources.

Current challenges in RAG systems

There are a variety of challenges in RAG systems and they’re not limited to LLMs. Most of these limitations apply to AI models in general.

  • Context Limitations: As mentioned before, models operate within a finite context. Even with chunking and compression, the simple input/output functionality of a generative model will surpass its context window. Even with a 1,000,000 token context window, the chunked and compressed data can eventually blow past it — causing the model to forget.
  • Bad or Irrelevant Chunks: Bad data needs to be kept out of the database. If Bab retrieves nonsensical data, Bab generates bad output. “The weather today is donuts!”
  • Stale Data: Vector DBs need to be managed just like any other traditional database. The user’s name should stick in a form of long term memory. Weather data needs to stay fresh. With stale data, Bab might call for 10 inches of snow during the middle of summer.
  • Latency and Cost: Every search eats clock cycles and bandwidth. Pulling lots of search results and ranking them can make Bab slow while devouring your infrastructure budget.

Conclusion

RAG today is still young. Right now, we can plug Bab into vector databases and simple tools like scrapers and weather APIs. In the future, models will be able to handle multi-step reasoning, control tools and decide which knowledge sources are trustworthy.

The trend is pretty clear. We need better orchestration, smarter retrieval and bigger context windows. People today often forget how limited AI models are in comparison to the human brain. A human can read a 1,000 page book and summarize it in one go. A model can’t, yet. For truly efficient RAG, we need context windows large enough to ingest the prompts and still have enough space for the required output.

RAG keeps your model from living in the past. It allows you to personalize your assistants. Complex RAG systems can separate short and long term data. RAG takes models from bland and boring to fresh and insightful.