Why don't we store data directly instead of using slot directories?

Explore this question in depth in our interactive lesson on File Structure and Disk Block Layout.

What happens when a new record is too large for free space?

Explore this question in depth in our interactive lesson on File Structure and Disk Block Layout.

How does page compaction impact performance during heavy write loads?

Explore this question in depth in our interactive lesson on File Structure and Disk Block Layout.

Does header size affect the amount of available data space?

Explore this question in depth in our interactive lesson on File Structure and Disk Block Layout.

Are page sizes consistent across all types of database systems?

Explore this question in depth in our interactive lesson on File Structure and Disk Block Layout.

Lesson 2

File Structure and Disk Block Layout

~7 min75 XP

Introduction

Database systems are high-performance engines designed to retrieve vast amounts of information in milliseconds. To understand how they achieve this speed, we must look past the SQL queries and delve into how data is physically laid out on disk hardware.

Disk Block Layout and Page Management

At the lowest level, a database does not read or write individual rows; it operates on pages (or blocks), which are chunks of memory typically ranging from $4$ KB to $16$ KB. When the database engine needs to read a specific record, the operating system fetches the entire page containing that record into memory.

The disk block layout is organized into headers, slot directories, and data records. The header stores metadata like the page ID, checksums for integrity, and pointers to the previous and next pages. The slot directory is the secret sauce for variable-length records; instead of searching for a record, the system looks at the directory, which provides an offset pointing to the exact byte where a record begins within the page. This indirection allows the database to defragment records inside a page without changing their logical identifiers.

Note: Because disk input/output is the most expensive operation in a database, the goal of physical design is always to minimize the number of page reads required to satisfy a query.

Why do database systems use a slot directory inside a data page?

Buffer Management

Since moving data from a physical disk (HDD or SSD) to main memory (RAM) is glacially slow compared to CPU cycles, databases use a buffer pool. This is a dedicated portion of memory that acts as a cache for disk pages.

When a query requests data, the buffer manager first checks if the page is already in the buffer pool (a "cache hit"). If it is not (a "cache miss"), it reads the page from disk and places it into the pool. To keep the pool efficient, the manager must decide which pages to evict when the memory is full. A common policy is LRU (Least Recently Used), which discards the page that hasn't been accessed for the longest time. However, modern databases often use variations like Clock Sweep or LRU-K to handle "sequential scan" scenarios where a single large read would otherwise flush all frequently used pages from the cache.

B+ Trees and Indexing

While a file might store data in an unordered heap, we need a way to find records quickly without scanning the entire file. The B+ Tree is the standard data structure for this.

A B+ Tree is a balanced, multi-level tree structure where all data is stored in the leaf nodes, while internal nodes only contain keys and child pointers. This structure ensures that no matter which record you look for, the height of the tree remains shallow—usually 3 or 4 levels—meaning that finding any record in a table of millions of rows takes only a few disk fetches.

In a B+ Tree, both internal nodes and leaf nodes contain the actual data rows.

Bloom Filters: Avoiding Unnecessary Reads

Sometimes, we don't know if a record exists in a file, and checking the B+ Tree might still require a disk read. A Bloom Filter is a space-efficient probabilistic data structure that answers the question: "Is this key definitely NOT in the set?"

It works by using multiple hash functions. When a key is inserted, the filter sets specific bits in a bit array to 1 based on the hash outputs. To check for a key, the system hashes it again and checks the corresponding bits. If any bit is 0, the record definitely doesn't exist. If all are 1, it might exist. This allows the database to "skip" disk reads entirely if the Bloom Filter returns a "no."

A ___ filter is a probabilistic data structure used to determine if an element is definitely not present in a set, preventing unnecessary disk I/O.

LSM Trees and Write Optimization

Traditional B+ Trees struggle with heavy write workloads because they require frequent updates to the tree structure, which can cause random I/O. LSM Trees (Log-Structured Merge-Trees) optimize for writes by buffering them in memory (MemTable) and flushing them to disk as sorted, immutable files called SSTables (Sorted String Tables).

When you update data, you don't overwrite the old version; you just write a new "delta" to the latest file. Periodically, a background process called compaction merges these files, removing duplicates and outdated versions. This turns random write patterns into sequential writes, which are significantly faster on both HDDs and modern SSD storage.

What is the primary advantage of an LSM Tree over a B+ Tree for write-heavy workloads?

Key Takeaways

Databases manage physical storage via pages, using slot directories to allow for flexible data movement.
Buffer pools minimize expensive disk reads by caching frequently accessed pages in RAM using smart eviction policies.
B+ Trees provide efficient logarithmic-time searching, keeping disk reads low even with massive datasets.
Bloom filters act as a front-line defense, allowing the database to skip searches for keys that do not exist.
LSM Trees prioritize write performance by transforming random writes into sequential disk operations through background compaction.

Finding tutorial videos...

Go deeper

Why don't we store data directly instead of using slot directories?🔒
What happens when a new record is too large for free space?🔒
How does page compaction impact performance during heavy write loads?🔒
Does header size affect the amount of available data space?🔒
Are page sizes consistent across all types of database systems?🔒