Skip to content
Learn · Happyness Mallya

What Is RAG (Retrieval-Augmented Generation)? Plain English

A calm, plain-English guide to RAG (retrieval-augmented generation): why it exists, how the pipeline works, where it's used, and its honest limits.

Happyness Mallya··9 min read
Retrieval-augmented generation (RAG) — an abstract layered data structure
Photo by Alina Grubnyak on Unsplash

I once asked an AI chatbot a question about my own company's refund policy. It answered instantly, confidently, and completely wrong. It invented a 14-day window that didn't exist, in a tone so calm I almost believed it over my own documents. The problem wasn't that the model was stupid. The problem was that it had never read my refund policy. It was answering from memory, the way you'd answer a trivia question at a dinner party: from whatever happened to stick in your head, with no chance to look anything up.

That gap, between a model's stale memory and the specific, current truth you actually need, is exactly what RAG was built to close. So let me walk you through what it is, in plain English, using an exam everyone has sat through.

Closed-book exam vs open-book exam

Picture two versions of the same test.

In a closed-book exam, you walk in, sit down, and answer purely from memory. If you studied the right things and remembered them, you do well. If the question is about something you never learned, or learned wrong, or learned five years ago and it's since changed, you're stuck. You'll either guess or, if you're the confident type, make something up that sounds right.

In an open-book exam, you're allowed to bring your notes. When a question comes up, you flip to the relevant page, read the actual passage, and then write your answer based on what's in front of you. You still need to be smart enough to understand the notes and phrase a good answer, but you're no longer relying on memory alone. You're grounded in the real material.

A plain large language model, on its own, is taking a closed-book exam. It answers from what it absorbed during training, which is a frozen snapshot of the past that contains nothing about your private documents.

RAG turns it into an open-book exam. Before the model answers, you first go and retrieve the relevant pages from a source you trust, your own documents, your knowledge base, your latest data, and you hand those pages to the model along with the question. That's the whole idea behind the name: retrieval-augmented generation. You retrieve real information, you augment the prompt with it, and then the model generates its answer grounded in what you gave it.

Why RAG exists

If you've read how large language models actually work, you know a model's knowledge is baked in at training time and then frozen. That frozen-ness causes three very practical headaches, and RAG addresses all three.

It reduces hallucination. When a model has no real source in front of it, it fills gaps with plausible-sounding inventions, like my phantom refund policy. Give it the actual document and ask it to answer from that, and it has far less room to fabricate. It's reading instead of guessing.

It adds private and current knowledge. The model was never trained on your internal handbook, last night's support tickets, or this morning's prices, and it never will be. Retrieval lets you feed it that material on the fly, so it can answer about things it has genuinely never "seen" before.

It's far cheaper than retraining. The brute-force alternative is to retrain or fine-tune the model every time your information changes. That's slow, expensive, and you'd be doing it constantly. With RAG, the model stays exactly as it is. You just update the notes it's allowed to read, which is as simple as adding a file.

The pipeline, step by step

RAG sounds clever, and it is, but the machinery is a handful of plain steps. There are two phases. First you prepare your notes (you do this once, ahead of time). Then you answer questions (this happens every time someone asks).

Preparing the notes:

  1. Chunk. You can't shove an entire 300-page manual at a model for every question, and you wouldn't want to. So you slice your documents into bite-sized passages, maybe a few paragraphs each. Each chunk is roughly one "page" you might later flip to.

  2. Embed. This is the one genuinely unusual step, so let me slow down. An embedding is a way of turning a chunk of text into a long list of numbers that captures its meaning. Two passages about cancellation fees end up with similar numbers, even if they share no exact words, while a passage about office parking lands somewhere completely different. Think of it as giving every chunk a precise location on a vast map of meaning, where related ideas sit close together.

  3. Store in a vector database. Those number-lists (the "vectors") get saved in a special database built to do one thing extremely fast: given a new point on the map, find the nearest existing points. That's a vector database, and "nearest" here means "closest in meaning."

Answering a question:

  1. Retrieve by similarity. When a question arrives, you embed the question the same way, turning it into its own point on the map. Then you ask the vector database, "which stored chunks sit closest to this?" Back come the handful of passages most related in meaning to what was asked. This is the open-book moment, flipping to the right pages.

  2. Augment the prompt. You take those retrieved passages and paste them into the prompt alongside the user's question, with an instruction roughly like: "Using the context below, answer the question. If the answer isn't in the context, say so."

  3. Generate. The model reads the question and the real passages you handed it, and writes an answer grounded in that material. Often it can even cite which chunk it used, so a human can check the receipt.

That's RAG end to end: chunk, embed, store, retrieve, augment, generate. Notice the model itself never changed. All the intelligence about your information lives in the retrieval steps that happen before the model is even involved.

Where you'll actually see it

You've probably used RAG already without anyone naming it.

The biggest use is chatbots that answer over a specific body of documents. A company points one at its internal handbook, product docs, and policies, and now employees can ask plain questions and get answers drawn from the real, current material instead of a generic guess. This is also the backbone of serious customer support bots, the ones that actually resolve issues rather than looping you in circles, because they're reading the genuine support articles before replying.

It's quietly behind a lot of modern search, too. Instead of handing you ten blue links and wishing you luck, a search experience can retrieve the relevant passages and have a model synthesize a direct answer with sources you can click. And RAG is a common ingredient inside AI agents, where "go and look something up before acting" is one of the most useful things an agent can do.

The honest limits

I promised plain English, and plain English means telling you where this breaks, because RAG is powerful but it is not magic.

Retrieval quality is everything. This is the line I'd tattoo on the project if I could. The model can only answer well using the passages it was handed. If your retrieval step fetches the wrong chunks, irrelevant ones, or misses the one page that actually held the answer, the model is back to guessing, just with extra confidence. Most disappointing RAG systems aren't failing at the "generation" part. They're failing at the "retrieval" part: bad chunking, weak embeddings, or a question phrased so differently from the documents that the map points to the wrong neighbourhood. Garbage retrieved, garbage generated.

It still can hallucinate. Grounding reduces invention; it doesn't abolish it. A model can be handed the correct passage and still misread it, blend it with its own stale memory, or quietly answer beyond what the text supports. Always treat output as a draft to verify, especially for anything that matters, like medical, legal, or financial answers.

It has a budget. You can only stuff so many passages into a single prompt, so retrieval has to be genuinely selective. And every extra step, embedding, searching, augmenting, adds a little latency and cost. RAG is cheaper than retraining, but it isn't free.

None of this is a reason to avoid RAG. It's the most practical way we have right now to make a general model answer reliably about your specific, current world. It just means the hard, interesting work lives in the retrieval, not in the model. Get the right pages in front of it, and a closed-book guesser becomes a careful open-book reader.

Frequently asked questions

Is RAG the same as fine-tuning a model?
No, and it's worth keeping them separate. Fine-tuning changes the model itself by training it further, which is good for teaching new behaviour or style but slow and costly to keep current. RAG leaves the model untouched and instead feeds it relevant documents at question time. RAG is the better fit when your information changes often or is private. Many systems use both.
Do I need a vector database to do RAG?
For anything beyond a toy, yes. A vector database is what makes 'find the chunks closest in meaning to this question' fast across thousands or millions of passages. For a tiny set of documents you could technically search them all every time, but that doesn't scale. The vector database is the engine that makes retrieval practical.
Why does RAG reduce hallucination instead of eliminating it?
Because grounding gives the model a real source to lean on, which removes a lot of the need to invent. But the model can still misread the passage, mix it with its own training memory, or stretch beyond what the text says. RAG makes fabrication far less likely, not impossible, so verification still matters.
What is an embedding, in one sentence?
An embedding is a list of numbers that represents the meaning of a piece of text, positioned so that passages with similar meaning end up close together, which is what lets a system retrieve by meaning rather than by exact keyword match.
When should I not bother with RAG?
If the model already knows the answer reliably from general knowledge, or your task is about style and reasoning rather than specific facts, RAG just adds complexity and latency for no gain. Reach for it when answers must come from private, current, or specific information the model was never trained on.

Further reading on this site

If this made RAG finally click, subscribe to the newsletter and I'll send the next plain-English explainer straight to your inbox.

Share

9 min read

The Newsletter

Liked this essay?

Get the next one in your inbox. One thoughtful email a week, nothing more.