ELI5: How Do LLMs Work? Part 3 — Transformers: Robots in Disguise

9 minute read

Welcome to the third article in our “ELI5” series on LLMs. This series offers a candid, informed look at today’s technology with no buzzwords allowed. If you missed Part 1 or Part 2, you can check those out here and here

In this article, we’ll break down transformers: what they are, why they matter, and how they help LLMs understand and generate language. The topic can get extremely technical and dense, but hey, we're going to do our best. Let’s dive in.

What is a Transformer?

A transformer is a kind of robot action figure that turns into a car or a truck or something. They were very popular in the 80s, 90s, and 2000s. But you're probably here for the kind of transformer that powers LLMs, so fine—we'll talk about that instead.

Most modern LLMs use something called the "transformer architecture" which was introduced in the 2017 paper Attention is All You Need by a group of researchers at Google. This paper is credited as a major turning point in the field of AI development and is now one of the most cited research papers of the 21st century. It's pretty academic, but worth a read if you're into that kind of thing.

Put simply, a transformer is an architecture pattern for building deep learning models.

Most modern LLMs you’ve heard of, like GPT, use the transformer architecture. People often use “transformers” and “LLMs” interchangeably, but that’s not entirely accurate. Transformers are a general-purpose architecture that can be used in many AI applications beyond LLMs, such as computer vision.

Okay, so a transformer is just an architecture pattern. I’ve been around the block. I know about architecture patterns. I lived through the "microservices" phase. So what does a transformer actually do?

At the core of a transformer are two main components: encoders and decoders. Encoders are used to read text, and understand input data. Decoders are used to generate output, like predicting the next word in a sentence. Both the encoder and decoder work with tokenization, embedding vectors, and all the foundational pieces we covered earlier.

What really sets transformers apart is a mechanism called attention, specifically self-attention. We'll talk more about this in a bit, but briefly, attention is a technique that helps the model decide which parts of the input are most relevant to each word it is processing. It allows the model to consider relationships between words across the entire sentence, rather than just looking at them in order.

GPT, BERT, and Transformer Variations

I’ll give you one guess what the “T” in GPT and BERT stands for. It’s transformer. GPT stands for Generative Pre-trained Transformer, and BERT stands for Bidirectional Encoder Representations from Transformers.

These are two variants of the transformer architecture optimized for different tasks. As you might have guessed, GPT models focus on generating text by focusing on the decoder part of the transformer architecture. BERT, on the other hand, is designed for understanding tasks like sentiment analysis and text classification by focusing on the encoder part of the transformer. Many modern GPTs don’t even use encoders anymore - just focusing on the decoder part for next-word generation.

The first GPT was introduced by OpenAI in 2018. BERT was introduced by Google in 2018 as well. Since then, organizations like Meta have released GPT-style models such as LLaMA and BERT variants like RoBERTa.

Attention

Attention is a technique that tries to figure out the importance of each part of a sequence relative to the other parts. Basically, when I’m looking at one part of a sentence and trying to understand it, attention helps decide what other parts of the sentence I should also be looking at, and how important those parts are in shaping the meaning of the part I’m focused on.

Let’s take the sentence: "The rabbit sat on the floor because it was lazy." When the model gets to the word "it", it needs to decide what "it" refers to. The floor? Probably not. More likely, we can tell that it probably refers to the rabbit. But how do you know that? It’s because your brain is doing something similar to what the attention mechanism does. You use context clues and your own experience with language to figure out that "rabbit" makes more sense. In your mind, you give more importance (attention calls this "weight") to "rabbit". Attention works the same way, except it uses some sophisticated math to model that weighting. It helps the model assign more relevance to certain words, like "rabbit", when trying to understand a sentence. This weighting helps the model understand relationships and dependencies between words, which contributes to how it interprets meaning.

Attention is the special sauce, so to speak, that makes LLMs work.

How Does Attention Work?

Attention is usually the hardest part for people to grasp about LLMs. It certainly was for me. It involves a non-trivial amount of math and some abstract thinking. If you're reading this and thinking, "Attention is easy and fun!" - then congratulations. Now shut up, nerd.

Self-attention is the mechanism that allows each token in an input sequence to consider, or “attend to,” all the other tokens in the input sequence when computing its own representation. "Regular" attention can involve two separate input sequences, while self-attention is named that way because it operates within a single sequence of inputs.

In our earlier example, "The rabbit sat on the floor because it was lazy," each word gets to “look at” every other word to decide what matters. This is what allows the model to resolve references like "it" and understand how different parts of the sentence relate to one another.

In self-attention, the goal is to compute a context vector for each element in the input sequence. A context vector is essentially another enriched embedding vector that not only represents the current element and its positional information, but also considers the same information from every other element in the sequence.

It "considers" that information by looking at the embedding vectors of every other element in the input sequence. There are many ways to compute attention, and we'll walk through them in the next sections. The output — the context vector — is a new, richer representation of the original token. It blends information from across the entire sequence, weighted by relevance. This is how the model understands nuance and relationships: every word becomes aware of what’s happening around it.

Further reading suggestion: Hands-On Large Language Models.

Scaled Dot Product Attention

Attention mechanisms differ in how they calculate the final context vector, and some methods get quite sophisticated.

One of the most popular methods is called scaled dot-product attention, introduced by the Attention is All You Need paper we mentioned earlier.

Here’s how scaled dot-product attention works, step by step:

First, every token’s original embedding vector is transformed into three different vectors called query, key, and value vectors. These are not pulled from thin air — they come from multiplying the original embedding vector by a set of learned weight matrices that are updated during training.
Next, we calculate the attention score between the query vector of the token we are focusing on and the key vectors of all the other tokens. To do this, for the token we are focusing on, we take the dot product of its query vector with the key vectors of every token in the sequence. Why dot product? The dot product measures how similar two vectors are. In attention, it tells us how well a token’s query matches another token’s key. A higher dot product means more relevance — so the model knows to focus more on that token.
These raw scores can sometimes be very large or very small, which can make training unstable. To fix this, we divide the scores by the square root of the dimension of the key vectors. This scaling keeps the values in a stable range - and is also why it's called "scaled" dot product attention.
Then, we apply a Softmax function to these scaled scores to turn them into probabilities that sum to 1. This step turns the raw scores into attention weights, telling us how much focus each token deserves relative to the current one.
Finally, to get the context vector for the token we are processing, we multiply each token’s value vector by its corresponding attention weight. Tokens that scored higher contribute more to the final context vector. We sum up these weighted value vectors, resulting in a new, context-rich vector that combines information from across the entire sequence but weighted to emphasize the most relevant parts.

That context vector is what the model uses to represent the token with all the relevant information it needs to understand the sentence. In short, scaled dot-product attention lets the model be very smart about which parts of the sentence to focus on for each word, making it possible to understand complex language patterns and generate coherent, meaningful text.

Other Types of Attention

Scaled dot-product attention is the go-to in modern transformers, but it's not the only one at the party when it comes to attention. For example, Bahdanau attention came even before scaled dot product attention, back in 2015, and was used in early translation models. It’s slower but helped pave the way for everything that came after. Multi-head attention runs multiple attention calculations in parallel so the model can focus on different parts of the sequence at once. Masked and causal attention are used in models like GPT to prevent the model from looking ahead when it’s predicting the next word.

You could spend years studying attention, and plenty of people have. But we’ve covered the essentials here: what attention is, why it matters, and how it helps LLMs like ChatGPT understand language.

We’ve been through a lot in this article. Let's wrap it up.

Conclusion

Transformers are the foundation of modern LLMs, and attention is the secret sauce that makes it all work. Memorizing all the math and the specifics of every attention mechanism is extremely difficult (and probably unnecessary). To make things more fun, every LLM you encounter probably has its own flavor of attention.

In the next part of this series, we'll finally start to break out of this abstract vector nonsense and talk about how models use these concepts to train and actually start doing some magic.

As always, no buzzwords, no fluff, and no entertainment. See you in Part 4.