ELI5: How Do LLMs Work? Part 4 - Training Day

10 minute read

Welcome to the fourth article in our “ELI5” series on LLMs. This series offers a candid, informed look at today’s technology—with no buzzwords allowed. If you missed Part 1, Part 2, or Part 3, you can check those out here, here, and here.

Now, at this point in the series, you might be asking: "Isn't this supposed to be ELI5? Why is it split across multiple articles? And what kind of five-year-old is reading about the inner workings of LLMs?"

I'll level with you guys, we’ve gone off the rails in this series. We’re three articles deep and still talking about abstract vector math. But the good news is that this article is going to get us back on track. Maybe.

In this fourth article, we're going to talk a bit more about the transformer architecture and how these models are actually trained and refined. Less math, more practical stuff. Let’s dive in.

Transformer Layer

At the core of the transformer encoder or decoder is something called a transformer layer. Each layer consists of a few main parts — typically a self-attention block and a feedforward block — often referred to together as transformer blocks.

Say our input is the sentence: “The rabbit ate the.” After tokenization and embedding, we get 4 vectors — one for each token (where there is a 1:1 mapping of tokens to word in this example). These vectors go into the first transformer block. The self-attention mechanism runs and produces 4 new vectors, called context vectors, which we covered earlier.

Next, the transformer block adds the original input vectors to these context vectors and applies normalization. Adding the original input helps preserve the starting information, so each transformer layer tweaks the data rather than changes it completely. Normalization stabilizes training by avoiding the problem of exploding or vanishing values — where numbers blow up to infinity or shrink to zero, derailing the model.

The result then flows into the feedforward block. Despite the name, it’s just a small neural network (a stack of mathematical functions applied independently to each token vector) with no interaction between tokens at this stage.

Inside the feedforward block, each vector gets expanded to a larger size, processed through activation functions, then shrunk back to its original size. These activation functions are constant mathematical functions — they don’t have any learnable parameters — but they introduce the nonlinear behavior that allows the model to learn complex patterns. Without them, the model could only perform simple linear transformations. For instance, a 768-dimensional vector might expand to 3072, get passed through one of these fixed nonlinear functions and then shrink back to 768. Common activation functions include ones like GELU, but we ain't got time for all that here.

After the feedforward block, the input to that block is added back again, and normalization is applied one more time. So by now, the vectors have been enriched with context from attention, transformed nonlinearly, and stabilized withresidual connections and layer normalization.

That’s one transformer layer. GPT-style models stack a dozen or more of these, where the magic compounds. The same applies to BERT-style models — they stack multiple transformer layers, each built from self-attention and feedforward blocks, to deepen understanding. The main difference: encoders use bidirectional self-attention (seeing all tokens), while decoders use masked self-attention (to avoid looking ahead).

So in short: a transformer layer takes in vectors — from embeddings or a previous layer — runs attention, applies some mathematical magic, and sends out a fresh set of vectors.

Further reading suggestion: Build a Large Language Model (From Scratch)

Parameters

I know where you’re at right now — four articles in, and it feels like all we’ve done is talk about vectors. You’re probably wondering: Where does this vector rainbow end? Don’t worry — we’re almost there. But first? Parameters.

A parameter (sometimes called a weight) is a trainable number inside the model — basically a knob the model can tweak to improve its performance. The more parameters a model has, the more complex patterns it can learn. GPT-2, for example, had a relatively modest 1.5 billion parameters. GPT-4? It’s rumored to have a staggering 1.76 trillion.

During training, the model adjusts these parameters to minimize its loss function — a number that tells it how far off its predictions are from the correct answers. Lower loss = better performance. That’s how the model learns.

These parameters live inside the transformer layers — but where exactly? In several key places.

In the self-attention components, the model has three distinct sets of parameter matrices: the query (Q) weight matrix, key (K) weight matrix, and value (V) weight matrix. These are learned during training and stay fixed during inference. When a token vector enters the layer, it’s multiplied separately by each matrix to produce the Q, K, and V vectors.

Next, the token vector flows into the feedforward block, which has its own set of parameters — typically two large weight matrices and some biases. First, the vector is multiplied by the first matrix, expanding its size. Then it passes through a nonlinear activation function (recall that these have no parameters), and finally through a second matrix that shrinks it back to the original size. These feedforward parameters are also learned during training and help the model transform each token’s representation in complex, nonlinear ways — but independently from the other tokens.

Aside from those two main blocks, parameters also show up in other spots — like in the normalization layers, where small learnable parameters help stabilize the data as it moves through layers. You'll also find parameters in the embedding layers (where tokens are converted into vectors) and the output projection layer (where vectors are turned into token predictions).

Almost every step of the transformer architecture has some set of learnable weights. And during training, each of them is gradually adjusted to minimize loss and make the model better at predicting the next word, phrase, or sentence.

Output Generation

Alright, so we’ve got all these layers, all these parameters, and all these vectors flying around. But how does this actually turn into a prediction? How does ChatGPT, for example, decide what word comes next in a sentence?

After a token passes through the full stack of transformer layers, what comes out the other side is a single vector — the final context vector for that token. This vector is a condensed, context-rich representation of everything the model has learned about that token in relation to the others. It reflects the cumulative results of attention scores, feedforward transformations, layer normalizations, and parameter tuning.

But a vector isn’t a word — it’s just a list of numbers.

To turn that vector into an actual prediction — like the next word in a sentence — the model does one last thing: it multiplies the context vector by a big matrix called the output projection matrix. This matrix has one row for every token in the model’s vocabulary.

Why does this work? When the model multiplies the context vector by this matrix, it’s essentially checking: How similar is this context vector to each possible token? The result is a list of raw scores — one per token — showing how well each word fits the current context.

Here’s the trick: during training, the model learns to shape the context vector so that it ends up close (in vector space) to the row in the output projection matrix that corresponds to the correct next token. It does this by adjusting its parameters until the context vector lines up nicely with that token’s vector. The better the alignment, the higher the score after the multiplication.

But these scores aren’t probabilities yet — they’re just numbers. To turn them into actual probabilities, the model runs them through the Softmax function. Softmax takes all the scores, boosts the highest ones, shrinks the lower ones, and turns them into values between 0 and 1 that add up to 1. Now the model has a probability for every possible token.

Whichever token has the highest probability is usually selected — unless the model is using sampling or temperature (low temperature boosts the top probabilities, high temperature means more randomness) to add variety. That’s the final output. So, given something like “The rabbit ate the,” the model might assign high probabilities to “carrot,” “lettuce,” or “grass,” and very low ones to things like “motorcycle.”

And just like that, the next token is picked.

Training

But how does the model learn to make better predictions?

During training, the model is fed massive amounts of unlabeled data from the corpora. For every input sequence, it tries to predict the next token one at a time. After each guess, it compares its prediction to the actual next token and computes something called loss — a number that measures how wrong it was.

So how does the model measure “how wrong” it was? When predicting the next word, the model assigns a probability to every token in its vocabulary. Ideally, the correct token should have a probability close to 1, and all others near 0. The way the model measures the difference between its guess and the truth is called cross-entropy loss. This gives a single number that shows how far off the prediction is — zero means a perfect guess, and bigger numbers mean worse mistakes. Cross-entropy loss looks at the probability the model gave to the correct token. If that probability is high, the loss is small. If it’s low, the loss grows larger. In simple terms, the loss punishes the model more when it’s less confident about the right answer.

After calculating the loss, the model uses backpropagation to figure out how much each parameter contributed to the mistake. It does this by computing gradients — signals that show how changing each parameter would affect the loss. Parameters with bigger gradients need bigger adjustments. Using these gradients, the model slightly tweaks its parameters to reduce the loss, improving future predictions.

This cycle — predict, compare, correct, repeat — happens over and over again, billions of times across billions of tokens. Each iteration nudges the model toward slightly better performance. Over time, those tiny adjustments accumulate into a model that’s shockingly good at predicting what comes next.

In GPT-style models, this process is always left-to-right: predict the next token, then the one after that, and so on. In BERT-style models, it’s a bit different — the model tries to fill in missing words from both directions — but the idea is the same: guess a token, measure how wrong you were, and adjust the parameters to get it right next time.

That’s training in a nutshell — millions of micro-corrections that add up to a model that sounds like it actually understands language.

Conclusion

In short, training an LLM is a repeated cycle of guessing the next word, measuring the error, and fine-tuning millions or billions of parameters across the transformer’s layers. This continuous learning transforms raw data into a model that generates surprisingly fluent and meaningful text—almost like it understands language, even though it’s really just math and pattern recognition.

In the final article in this series, we’ll wrap things up by covering some final LLM concepts you’ve probably heard of. Don’t worry—no more vector math. Promise. In case you missed Part 1, Part 2, or Part 3, you can check those out here, here, and here.

See you in Part 5.