How Large Language Models Actually Work (Plain English)
A calm, honest explainer on how large language models work: tokens, training, parameters, fine-tuning, context windows, and why LLMs hallucinate.

The other day I typed half a sentence into a chat box: "The best way to learn anything is to..." and before I finished, the cursor was practically begging to write "practice it every day." I deleted that and tried "The capital of Tanzania is..." and a model would happily complete it with "Dodoma." Two very different sentences. Same trick underneath. The thing on the other side wasn't reaching into a database of facts or thinking about my question the way you and I think. It was doing one thing, extremely well, billions of times: guessing what text comes next.
That's the whole secret, and also where most of the confusion starts. So let me slow down and actually teach it, because "it's just fancy autocomplete" is true enough to be useful and wrong enough to be dangerous.
The one job: predict the next chunk of text
A large language model — an LLM — is a program trained to do a single task: given some text, predict what comes next. Not the whole answer at once. Just the next little piece. Then it adds that piece, looks at everything again, and predicts the next piece after that. One step at a time, like someone laying bricks, except each brick is chosen by asking "what brick usually follows these bricks?"
Yes, that sounds like the autocomplete on your phone. But your phone's autocomplete looks back maybe two or three words and offers three stiff guesses. An LLM looks back across thousands of words, and its sense of "what usually comes next" was shaped by reading a meaningful slice of everything humans have written down. The difference in scale is so large it stops being a difference in degree and becomes a difference in kind. Autocomplete finishes your word. An LLM can finish your argument, your poem, your code, or your excuse — because somewhere in all that text, patterns of reasoning and style and structure got baked in alongside the patterns of grammar.
Tokens: the bricks aren't words
When I say "next chunk of text," I'm being careful, because models don't actually work in words. They work in tokens. A token is a small piece of text — sometimes a whole word, often a fragment. "Learning" might be one token; "unbelievable" might split into "un," "believ," and "able." Common words get their own token; rare words get chopped up.
Why bother? Because it's a tidy compromise. Whole words would need a dictionary of millions of entries, most barely used. Single letters would make the model spell everything out painfully. Tokens sit in the middle — a vocabulary of a few tens of thousands of pieces that can assemble into any text, including words the model has never seen. So when you read that a model "predicts the next token," picture it choosing the next small brick from a fixed bin of brick shapes.
Training: reading the library, then being quizzed forever
So how does a pile of math learn what brick comes next? Through training, which is less mysterious than it sounds. You take an enormous amount of text — books, websites, articles, code, conversations — and you play a relentless guessing game with it.
The process hides the next token and asks the model to predict it. The model guesses. You compare the guess to the real next token. If it's wrong, you nudge the model's internal numbers a tiny bit so it would guess slightly better next time. Then you do that again. And again. Trillions of times. No human grades the answers; the text itself is the answer key, because the real next word is always sitting right there.
After enough rounds of this, something quietly remarkable happens. To get good at predicting the next word across billions of examples, the model is forced to absorb the patterns underneath the words — that questions tend to get answers, that "2 + 2 =" tends to be followed by "4," that a story about rain often mentions umbrellas. Nobody programmed those rules in. They fell out of the guessing game.
Parameters: the dials, and why there are so many
Those "internal numbers" I keep nudging have a name: parameters. You can think of a parameter as a single tiny dial. Each dial slightly influences how the model turns its input into a prediction. Modern LLMs have billions of these dials.
Here's an analogy that helped me. Imagine a mixing board in a recording studio, but instead of fifty knobs it has billions, and every knob affects every sound a little. Training is the long, automated process of turning each knob to the position that makes the output sound right. No single dial means anything on its own — "this one is the dial for sarcasm" isn't a real thing. Understanding lives in the combination, spread across the whole board. The number of parameters is roughly how much room the model has to store patterns. More dials, more capacity — which leads to the strangest part of this story.
Why scale produced surprises
For a long time, the honest expectation was that bigger models would be a bit better — smoother text, fewer mistakes. What actually happened surprised the people building them. As models grew larger and trained on more text, they started doing things no one explicitly trained them to do: translating between languages, solving word problems, writing working code, explaining jokes. Abilities seemed to appear once a model crossed some size threshold, as if a light flicked on.
I want to be honest and unhyped here: researchers still argue about exactly how and why this happens, and some of those "sudden" jumps look smoother when you measure them carefully. But the broad pattern is real and it reshaped the field. Scale — more data, more parameters, more computation — turned out to buy capabilities, not just polish. That discovery is most of why the last few years felt like a sudden explosion rather than a slow climb.
Training versus inference: two very different moments
There's a distinction that clears up a lot of confusion once you have it. Training is the slow, brutally expensive phase I described — months of computation, the guessing game, the dial-turning. It happens once (or occasionally, for updates), in a data center, long before you ever touch the thing.
Inference is what happens when you actually use the model. The dials are now frozen. You type a prompt, and the model runs its prediction loop to produce tokens for you. It is not learning from you. Your clever question does not update its parameters. When people say "the AI learned from our chat," they're usually wrong about the mechanism — the model isn't changing. It's a trained system being run, like a pressed vinyl record being played, not re-recorded.
Fine-tuning and RLHF: turning a raw model into a helpful one
Here's something that genuinely surprised me when I learned it. The raw model that comes out of all that training is not the polite assistant you chat with. Trained purely to predict text, it tends to ramble, mimic, dodge, or cheerfully continue an ugly prompt — because all of that exists in the text it learned from. It's fluent but feral.
Taming it takes two more steps. Fine-tuning continues training on a smaller, curated set of examples — say, well-behaved question-and-answer pairs — to push the model toward being a helpful assistant rather than a generic text-continuer. Then comes RLHF, reinforcement learning from human feedback. Humans compare the model's possible answers and mark which they prefer: more helpful, more honest, less harmful. That preference signal is used to nudge the model toward the kinds of answers people actually want. RLHF is a big part of why a modern assistant feels cooperative and stays within bounds, instead of just autocompleting whatever you started.
The context window: a desk, not a memory
When you chat with a model, everything it can "see" at that moment — your prompt, the conversation so far, any documents you pasted — lives in its context window. Think of it as the size of the desk it's working on. A bigger desk means it can keep more pages in front of it at once.
But two things trip people up. First, the desk has an edge. Push past the limit and the earliest pages slide off — which is why a very long conversation can make a model "forget" how it started. Second, the desk is wiped clean between sessions. Close the chat, open a new one, and the model has no memory of you. Whatever felt like it "remembered" you was either re-fed into the new context or stored by the surrounding app, not by the model itself.
Why they hallucinate (and what they are not)
Now the part everyone needs and few are told plainly. LLMs hallucinate — they state false things with total confidence — and once you understand the prediction loop, you can see why it's not a bug they'll simply patch out.
The model was trained to produce plausible text, not true text. Those usually overlap, which is why it's right so often. But when they diverge — an obscure fact, a citation that doesn't exist, a name it half-remembers — the model has no internal alarm that says "I don't actually know this." It just keeps choosing the most likely-looking next token. A fabricated but plausible answer and a correct one feel identical from the inside, because the model isn't tracking truth at all. It's tracking what sounds like the kind of text that should come next.
Which brings me to what these systems are not. They do not "know" things the way you know your own name. They do not understand meaning the way a person does, with a body and a life and stakes. They have no beliefs, no intentions, no awareness that there's a "you" on the other side. A model that writes a heartfelt apology feels nothing. It is an extraordinary pattern-matcher trained on human expression, and it can be genuinely, life-changingly useful — but treating its confidence as knowledge is the mistake that gets people burned. Use it like a brilliant, fast, occasionally confidently-wrong assistant. Check anything that matters.
Frequently asked questions
- Is an LLM just a fancy autocomplete?
- At the core mechanism, yes: it predicts the next token. But that framing undersells what scale does. Predicting the next token across a huge slice of human writing forces the model to absorb patterns of reasoning, style, and structure. So it's autocomplete in the way a symphony is just air pressure changes: technically true, but missing what emerges.
- Does the model learn from my conversations?
- Not during the chat itself. When you use a model (inference), its parameters are frozen and it isn't updating from your messages. Some companies may later use collected data to train future versions, but that's a separate, deliberate process, not the model learning live from you in the moment.
- Why does it sometimes make up facts so confidently?
- Because it's trained to produce plausible text, not verified truth. It has no internal sense of certainty, so a fabricated answer and a correct one are produced the exact same way. This is called hallucination, and it's a side effect of how the system works, not a simple bug.
- What's the difference between parameters and the context window?
- Parameters are the model's permanent, trained knowledge: billions of fixed dials set during training. The context window is temporary working memory for one conversation: the text the model can currently see. Parameters are what it learned; context is what it's looking at right now.
- Does a bigger model always mean a better one?
- Not always. Scale tends to buy more capability, but a smaller model that's well fine-tuned and aligned can beat a larger, rawer one for everyday use. Size is one ingredient. Training data quality, fine-tuning, and RLHF matter enormously for how useful a model actually feels.
Further reading on this site
- What Is Machine Learning? — the broader field that LLMs grow out of.
- Claude vs ChatGPT — how to choose between the assistants built on these models.
- Browse Technology — more plain-English explainers like this one.
If this made LLMs click for you, subscribe to the newsletter and I'll send the next plain-English breakdown straight to your inbox.
Sources
The Newsletter
Liked this essay?
Get the next one in your inbox. One thoughtful email a week, nothing more.
Keep reading
Related articles

What Is Machine Learning? A Plain-English Guide
A calm, beginner-friendly guide to machine learning: how computers learn from examples instead of rules, the three main types, and honest limits.
May 26, 2026 · 11 min read

What Is Vibe Coding? A Plain-English Guide
Vibe coding means describing what you want and letting AI write the code while you steer. Here is how it works, what it's great for, and where it bites.
June 10, 2026 · 8 min read

AI Agents Explained: What Actually Changed in 2026
A calm, plain-English explainer on AI agents: how they differ from chatbots, the core loop, real examples, why 2026 feels different, and honest limits.
June 8, 2026 · 10 min read