In the world of database management, reading data from a hard drive is significantly slower than processing data in memory. This lesson explores B+ Trees, the structural backbone that allows databases to retrieve specific records in logarithmic time without scanning every single row in a table.
To understand why we use B+ Trees, we must first look at why simple Binary Search Trees (BST) fail in database environments. In a traditional BST, each node contains one key and two child pointers. If you have millions of records, the tree becomes incredibly deep. Because reading from disk is expensive, every time you "jump" to a child node to traverse the tree, you potentially trigger a new disk I/O operation.
If a tree has a height of , performing a search requires disk accesses. For a balanced BST with elements, . If , you might visit 20 nodes, meaning 20 separate disk reads. To minimize these expensive reads, we need a tree with a high fan-outβmeaning each node should hold as many pointers as possible to flatten the tree structure. This is where B-Trees and subsequently B+ Trees come into play.
A B+ Tree differs from a standard B-Tree in two critical ways: all real data (or pointers to data) is stored exclusively in the leaf nodes, and all leaf nodes are linked together in a doubly-linked list. Internal nodes only store keys to act as "signposts" or navigation guides to help the search algorithm steer the path correctly.
By keeping internal nodes small and containing only keys, we can pack hundreds of keys into a single disk block. If a node holds 100 keys, the tree's fan-out is 101. Even with millions of records, a B+ Tree rarely exceeds a height of 3 or 4. This means almost any record can be found in just 3 or 4 disk reads.
The genius of the B+ Tree design shines during range queries, such as SELECT * FROM Orders WHERE Date BETWEEN '2023-01-01' AND '2023-01-31'. Because all leaf nodes are linked in a sorted linked list, the database performs a standard search to find the entry for '2023-01-01'. Once it arrives at that leaf node, it doesn't need to head back up the tree or perform new searches; it simply follows the leaf-level pointers horizontally until it reaches the record for '2023-01-31'.
Note: This is why professional database administrators often recommend clustered indexes on columns frequently used for ranges. It keeps the physical data ordered on disk in alignment with the leaf node order.
To keep the retrieval time at , the B+ Tree must remain balanced. When you insert a record and a leaf node becomes full, the system performs a node split. The middle key is promoted to the parent node. If the parent is also full, the split propagates upward.
Similarly, if a deletion causes a node to fall below a minimum occupancy threshold (usually 50%), the tree performs a rebalance by borrowing keys from a sibling or merging two nodes. While this makes insertion and deletion slightly slower than search-only structures, it guarantees that the tree never grows unevenly, ensuring the predictable performance that databases demand.