Database systems are high-performance engines designed to retrieve vast amounts of information in milliseconds. To understand how they achieve this speed, we must look past the SQL queries and delve into how data is physically laid out on disk hardware.
At the lowest level, a database does not read or write individual rows; it operates on pages (or blocks), which are chunks of memory typically ranging from KB to KB. When the database engine needs to read a specific record, the operating system fetches the entire page containing that record into memory.
The disk block layout is organized into headers, slot directories, and data records. The header stores metadata like the page ID, checksums for integrity, and pointers to the previous and next pages. The slot directory is the secret sauce for variable-length records; instead of searching for a record, the system looks at the directory, which provides an offset pointing to the exact byte where a record begins within the page. This indirection allows the database to defragment records inside a page without changing their logical identifiers.
Note: Because disk input/output is the most expensive operation in a database, the goal of physical design is always to minimize the number of page reads required to satisfy a query.
Since moving data from a physical disk (HDD or SSD) to main memory (RAM) is glacially slow compared to CPU cycles, databases use a buffer pool. This is a dedicated portion of memory that acts as a cache for disk pages.
When a query requests data, the buffer manager first checks if the page is already in the buffer pool (a "cache hit"). If it is not (a "cache miss"), it reads the page from disk and places it into the pool. To keep the pool efficient, the manager must decide which pages to evict when the memory is full. A common policy is LRU (Least Recently Used), which discards the page that hasn't been accessed for the longest time. However, modern databases often use variations like Clock Sweep or LRU-K to handle "sequential scan" scenarios where a single large read would otherwise flush all frequently used pages from the cache.
While a file might store data in an unordered heap, we need a way to find records quickly without scanning the entire file. The B+ Tree is the standard data structure for this.
A B+ Tree is a balanced, multi-level tree structure where all data is stored in the leaf nodes, while internal nodes only contain keys and child pointers. This structure ensures that no matter which record you look for, the height of the tree remains shallow—usually 3 or 4 levels—meaning that finding any record in a table of millions of rows takes only a few disk fetches.
Sometimes, we don't know if a record exists in a file, and checking the B+ Tree might still require a disk read. A Bloom Filter is a space-efficient probabilistic data structure that answers the question: "Is this key definitely NOT in the set?"
It works by using multiple hash functions. When a key is inserted, the filter sets specific bits in a bit array to 1 based on the hash outputs. To check for a key, the system hashes it again and checks the corresponding bits. If any bit is 0, the record definitely doesn't exist. If all are 1, it might exist. This allows the database to "skip" disk reads entirely if the Bloom Filter returns a "no."
Traditional B+ Trees struggle with heavy write workloads because they require frequent updates to the tree structure, which can cause random I/O. LSM Trees (Log-Structured Merge-Trees) optimize for writes by buffering them in memory (MemTable) and flushing them to disk as sorted, immutable files called SSTables (Sorted String Tables).
When you update data, you don't overwrite the old version; you just write a new "delta" to the latest file. Periodically, a background process called compaction merges these files, removing duplicates and outdated versions. This turns random write patterns into sequential writes, which are significantly faster on both HDDs and modern SSD storage.