How does a model decide where to split a word?

Explore this question in depth in our interactive lesson on How Large Language Models Actually Work.

Are emojis or special symbols treated as their own tokens?

Explore this question in depth in our interactive lesson on How Large Language Models Actually Work.

How many unique tokens does a typical LLM vocabulary contain?

Explore this question in depth in our interactive lesson on How Large Language Models Actually Work.

Does tokenization impact how LLMs understand different world languages?

Explore this question in depth in our interactive lesson on How Large Language Models Actually Work.

Can two different models have different tokenization systems?

Explore this question in depth in our interactive lesson on How Large Language Models Actually Work.

Lesson 2

How Large Language Models Actually Work

~6 min50 XP

Introduction

Large Language Models (LLMs) have revolutionized technology, but their inner workings are often shrouded in mystery. In this lesson, we will peel back the layers to understand how these systems transform raw text into coherent, human-like insight by mastering the mechanics of tokens and probability.

The Foundation: Tokenization

Before an AI can "read," it must convert human language into numerical data. This process is called tokenization. Instead of reading words letter-by-letter, models break text into chunks called tokens. A token can be a single word, a part of a word, or even a single character.

Think of a token as the "atomic unit" of an LLM’s vocabulary. The model assigns a unique numerical identifier to every token it knows. For example, the word "chatbot" might be represented as [124, 8932]. The reason for breaking words into sub-words is efficiency; it allows the model to handle rare words or complex prefixes and suffixes by combining known tokens, rather than needing an infinitely large dictionary.

Why do Large Language Models use sub-word tokenization instead of just processing entire words?

The Architecture: Transformers and Attention

The true "brain" of the modern LLM is the Transformer, an architecture that utilizes a mechanism called Self-Attention. Before Transformers, AI read text linearly, making it hard to remember the beginning of a long sentence. Self-Attention changes this by allowing the model to weigh the importance of every word in a sequence relative to every other word simultaneously.

If you have the sentence, "The bank was closed because it had no money," the model needs to know that "it" refers to "bank" and not "money." Through Self-Attention, the model creates a relationship mapping between tokens. It calculates a "score" for how much focus a specific token should pay to its neighbors. By looking at the entire context at once, the model builds a multidimensional representation—or embedding—of the word, which captures its meaning based on its surroundings.

Probability: Predicting the Next Token

At their core, LLMs are essentially sophisticated predictive engines. They do not "know" facts in the way humans do; instead, they calculate the probability distribution of the next token in a sequence.

Given a sequence of tokens $x_1, x_2, ..., x_t$ , the model aims to find the probability of the next token $x_{t+1}$ using a softmax function:

$P(x_{t+1} | x_1, x_2, ..., x_t) = \frac{e^{z_{x_{t+1}}}}{\sum_{j=1}^{V} e^{z_{j}}}$

In this formula, $V$ represents the size of the vocabulary, and $z$ represents the raw score (or logit) the model assigned to each possible next word. The model picks the next token based on these probabilities, though it often uses temperature—a hyperparameter—to control randomness. A low temperature makes the model conservative and deterministic, while a high temperature makes it more creative and unpredictable.

An LLM 'thinks' about facts by searching a database rather than calculating probabilities.

Training: From Predicting to Reasoning

The immense capability of LLMs comes from the pre-training phase. During this time, the model is fed massive datasets (books, websites, code) and tasked with a simple game: hide a word in a sentence and try to guess it. Every time the model guesses wrong, it uses backpropagation to adjust its internal weights.

By performing this billions of times, the model develops an internal map of how concepts relate to one another. This is where emergent properties arise—the model hasn't been programmed to understand logic or code, but through the process of accurately predicting patterns in data, it has no choice but to learn the rules of syntax, logic, and mathematics to succeed.

Note: While this training leads to high performance, it is also the origin of hallucinations, where the model confidently predicts the next word based on patterns even if the resulting sentence is factually incorrect.

The technique of adjusting the internal weights of a model based on its errors during training is called ___.

Key Takeaways

Tokenization breaks language into manageable numerical units, allowing models to process diverse human language efficiently.
Self-Attention enables the model to understand the context by weighing the relationships between tokens across an entire sequence.
LLMs function as predictive engines that use probability distributions to select the next most likely token in a string.
Training involves massive-scale iteration called backpropagation, where models learn logic and structure by minimizing their prediction errors.

Finding tutorial videos...

Go deeper

How does a model decide where to split a word?🔒
Are emojis or special symbols treated as their own tokens?🔒
How many unique tokens does a typical LLM vocabulary contain?🔒
Does tokenization impact how LLMs understand different world languages?🔒
Can two different models have different tokenization systems?🔒