Large Language Models (LLMs) have revolutionized technology, but their inner workings are often shrouded in mystery. In this lesson, we will peel back the layers to understand how these systems transform raw text into coherent, human-like insight by mastering the mechanics of tokens and probability.
Before an AI can "read," it must convert human language into numerical data. This process is called tokenization. Instead of reading words letter-by-letter, models break text into chunks called tokens. A token can be a single word, a part of a word, or even a single character.
Think of a token as the "atomic unit" of an LLM’s vocabulary. The model assigns a unique numerical identifier to every token it knows. For example, the word "chatbot" might be represented as [124, 8932]. The reason for breaking words into sub-words is efficiency; it allows the model to handle rare words or complex prefixes and suffixes by combining known tokens, rather than needing an infinitely large dictionary.
The true "brain" of the modern LLM is the Transformer, an architecture that utilizes a mechanism called Self-Attention. Before Transformers, AI read text linearly, making it hard to remember the beginning of a long sentence. Self-Attention changes this by allowing the model to weigh the importance of every word in a sequence relative to every other word simultaneously.
If you have the sentence, "The bank was closed because it had no money," the model needs to know that "it" refers to "bank" and not "money." Through Self-Attention, the model creates a relationship mapping between tokens. It calculates a "score" for how much focus a specific token should pay to its neighbors. By looking at the entire context at once, the model builds a multidimensional representation—or embedding—of the word, which captures its meaning based on its surroundings.
At their core, LLMs are essentially sophisticated predictive engines. They do not "know" facts in the way humans do; instead, they calculate the probability distribution of the next token in a sequence.
Given a sequence of tokens , the model aims to find the probability of the next token using a softmax function:
In this formula, represents the size of the vocabulary, and represents the raw score (or logit) the model assigned to each possible next word. The model picks the next token based on these probabilities, though it often uses temperature—a hyperparameter—to control randomness. A low temperature makes the model conservative and deterministic, while a high temperature makes it more creative and unpredictable.
The immense capability of LLMs comes from the pre-training phase. During this time, the model is fed massive datasets (books, websites, code) and tasked with a simple game: hide a word in a sentence and try to guess it. Every time the model guesses wrong, it uses backpropagation to adjust its internal weights.
By performing this billions of times, the model develops an internal map of how concepts relate to one another. This is where emergent properties arise—the model hasn't been programmed to understand logic or code, but through the process of accurately predicting patterns in data, it has no choice but to learn the rules of syntax, logic, and mathematics to succeed.
Note: While this training leads to high performance, it is also the origin of hallucinations, where the model confidently predicts the next word based on patterns even if the resulting sentence is factually incorrect.