ELI5: How Do LLMs Work? Part 2 - Mise En Place

10 minute read

Welcome to the second article in our “ELI5” series on LLMs — a candid, informed look at today's technologies AI tech with no fluff and no buzzwords. If you missed Part 1, you can check that out here.

Don’t worry, we’re not one of those sites that makes you scroll through a 900-word life story before giving you the recipe for a Caesar salad. In this article, we’re diving straight into some foundational concepts. They’re the building blocks that make everything else work.

Let’s get into it.

Tokenization

LLMs fundamentally process data as large sets of numbers. But LLMs doing NLP, almost by definition, deal with words, not numbers. So what’s going on here?

This brings us to one of the first critical concepts for understanding how an LLM works: tokenization.

Tokenization is the process of breaking text into smaller units called "tokens" and assigning a token ID (just a number) to each one. A simple example would be assigning a token to each word in the dictionary. In this way, any piece of text can be represented as a sequence of tokens.

When you hear about "token context" or "token response size" in LLMs, it is simply referring to the number of tokens the LLM will consider in your input or include in its response. So, if you request a 512-token response and each token corresponds to one word, you might get roughly a 512-word reply. In practice, tokens vary in size, as we’ll explore in future sections, but that’s the basic idea.

Byte-Pair Encoding and other Tokenization Methods

In practice, tokens are usually words, character sets, or combinations of words and punctuation. It’s rarely as simple as one character per token or even one word per token. Because what happens when you get a word you don’t recognize? Like some Dr. Seuss “Gluppity-glup” type nonsense? What do you do then, smart guy?

That’s where more advanced tokenization algorithms come in. One of the most popular is Byte-Pair Encoding, or BPE. BPE is an old-school algorithm from 1994 that’s found new life in LLM development. It's a relatively simple algorithm that works by repeatedly replacing the most common contiguous sequences of characters with a placeholder token that points to that sequence in a lookup table.

Let’s say you have the word "walking." A tokenizer might break it down like this:

Instead of treating "walking" as one whole token, it could split it into smaller chunks like "walk" and "ing."
So, the tokenizer assigns a token ID to "walk" and another token ID to "ing"
This way, if the model already knows "walk" and "ing" from other words, it can understand and generate new words it hasn’t seen before by combining tokens.

By operating on subwords rather than full words or phrases, BPE can efficiently generate tokens from any input without requiring prior knowledge of every word. This helps keep the total number of tokens manageable.

OpenAI maintains an open-source implementation of BPE called tiktoken, which is widely used by LLM developers. Other tokenization methods include WordPiece (used by BERT) and SentencePiece (used by LLaMA). These are similar to BPE in that they aim to create an efficient set of tokens for a text dataset, but their exact techniques differ slightly. And if you’re thinking BERT is from Sesame Street and LLaMA sounds like one of those fuzzy goat-type animals, that’s totally fine - stick with us for a few more articles. It’ll all start to make sense.

For reference, OpenAI’s GPT-2 has a vocabulary size of 50,257 tokens and a context window of 2,048 tokens. This means it recognizes about 50,000 unique token units in its training data and can process up to 2,048 tokens at once in prompts and responses. (Source: OpenAI GPT-2 paper).

Fun fact: “Gluppity-glup” is a real term from Dr. Seuss’s The Lorax.

Corpus

A corpus is a large collection of text or other data used to train a LLM. It can include various types of unstructured data such as text documents, speech transcripts, videos, or anything else depending on the model’s intended capabilities. For multimodal models, the corpus may contain some or all of these data types.

Collecting high-quality training data, and a lot of it, is one of the most important steps when preparing to train an LLM. Common sources include datasets like Common Crawl, public code repositories, or maybe even all the text messages you’ve ever sent that might be sitting on some NSA server somewhere, if you’re into that kind of thing.

Cleaning this data is crucial for building a good model. As the saying goes, garbage in, garbage out. It’s important to remove incorrect, toxic, low-quality, or irrelevant content (so maybe like most of Reddit). Some models also exclude content generated by other LLMs to reduce the risk of amplifying errors or biases.

Word Embeddings

Word embeddings are vector representations of words in a high-dimensional space, where semantically similar words are positioned closer together.

Now, that sentence is a doozy. I had to just get it out there, but don’t worry, we’re going to walk it back a bit. That’s my main gripe with a lot of books and articles on LLMs. They just drop these dense sentences on you and then cruise right past while your eyes glaze over and you start wondering if you should’ve just become a real estate agent or something.

Anyway, back to word embeddings. Let's break it down.

Embedding is the process of converting a word into a vector, which is just an ordered list of numbers with dimensions. Each number represents a dimension, and each dimension captures some aspect of a word’s meaning.

But here’s the kicker: we don’t manually define what each dimension means. There’s no lookup table saying, "dimension 48 is related to color." The model figures all that out on its own during training by looking at how words show up in context across the corpora.

So even though individual numbers might not make much sense on their own, the whole vector ends up carrying enough information for the model to understand relationships between words - things like similarity, analogies, and usage patterns.

We call it a "high-dimensional space" because these vectors can have hundreds (potentially thousands) of dimensions. More dimensions usually means better representation and potentially more accurate results, like sharper sentiment analysis, but it also means more training time and more computational overhead. Also, cranking up the dimensionality too much can actually hurt performance in some cases by causing the model to overfit the training data, which makes it less effective on new, unseen text.

The goal of word embedding vectors is to help LLMs understand the meaning of words based on their context. Instead of treating words as isolated symbols, embeddings attempt to allow the model capture relationships and similarities between words. Word embeddings place each word into a high-dimensional vector space. In this space, similar words appear close together.

This enables models to do some surprisingly intuitive things, like analogies. For example: "Paris is to France as Tokyo is to Japan."

This kind of analogy can be expressed using simple vector arithmetic. If you take the vector for Paris, subtract the vector for France, and add the vector for Japan, you land close to the vector for Tokyo. The model captures the relationship between capital cities and their countries. Pretty terrifying, right?

Further reading suggestion: Deep Learning for Natural Language Processing.

Bag of Words, Word2Vec, and Other Techniques

Embedding techniques laid the foundation for LLMs by showing how language could be represented numerically in a way that captures meaning. Some popular techniques that you might want to be aware of are Bag of Words and Word2Vec.

Bag of Words is a simple model that creates vector representations by counting how often each word appears in a document, ignoring grammar and word order. It is easy to implement but doesn’t capture word meaning or context.

Word2Vec is a more complex model developed at Google in 2013. It creates dense vector representations by training a neural network (recall that a neural network is just a bunch of weighted mathematical functions designed to find patterns in data) to predict a word based on its surrounding context, or predict the context from a given word. This results in embeddings where words with similar meanings are close together in the vector space.

There are other techniques, like GloVe, developed at Stanford in 2014, but we ain't got time for all that here.

Simple LLMs might use Bag of Words or Word2Vec directly, often with pretrained embedding values. Many of today’s larger models use similar ideas — dense vector representations that capture meaning — but they typically start with random embeddings and learn everything from scratch. As training progresses, these embeddings are gradually adjusted, allowing the model to build its own custom vector space tuned to the data it sees.

Positional Embeddings

In principle, simple embedding vectors can serve as inputs to language models. These methods represent words without capturing their order or context, which limits their usefulness in understanding natural language.

Modern LLMs improve on this by adding on something called "position-aware embeddings." The goal is to encode both the identity of a word and its position in the sequence in the embedding vector. This allows the model to distinguish between sentences which have the same words but different meanings due to word order.

There are two main types of position-aware embeddings:

Absolute positional embeddings: Each position in the input sequence is assigned a fixed embedding vector. This vector is added to the word embedding before being passed into the model.
Relative positional embeddings: Instead of assigning a fixed vector to each position, the model learns how the positions of words relate to one another. This can help the model generalize better across varying sequence lengths.

These techniques help the model understand the structure of a sentence and not just the individual words, which improves performance across many language tasks.

Conclusion

At this point in our (theoretical) process of building an LLM, we have gathered a corpus—a large collection of training data consisting of text files from various sources. We’ve cleaned and filtered this data, removing stray characters, fixing encoding issues, and eliminating low-quality content that we don’t want our model to learn from.

Next, we developed a tokenization algorithm and applied it to the training data. This process gave us a way to convert text into tokens, mapping sequences of letters to corresponding numerical representations.

Then, we have started to understand how we can use word embeddings to transform these tokens into high-dimensional vectors. These embeddings help the model understand the meaning and relationships between words based on their context, which is essential for generating coherent and relevant responses.

Good so far? If you’re with me, let’s move on to Part 3 here. If you'd need a quick breather, check our some of our articles that make fun of project managers.