How does the system handle reads with multiple SSTables?

Explore this question in depth in our interactive lesson on LSM Trees for Write-Heavy Workloads.

What happens to the WAL after a memtable flush?

Explore this question in depth in our interactive lesson on LSM Trees for Write-Heavy Workloads.

When does the system perform compaction on SSTables?

Explore this question in depth in our interactive lesson on LSM Trees for Write-Heavy Workloads.

How do LSM-trees manage deletions compared to updates?

Explore this question in depth in our interactive lesson on LSM Trees for Write-Heavy Workloads.

Why are SkipLists often preferred over balanced BSTs for memtables?

Explore this question in depth in our interactive lesson on LSM Trees for Write-Heavy Workloads.

Lesson 5

LSM Trees for Write-Heavy Workloads

~14 min100 XP

Introduction

In high-performance database systems, the structure of your data index determines how fast you can ingest records versus how fast you can query them. Today, we will explore the Log-Structured Merge-Tree (LSM-Tree), a data structure designed to transform random write operations into sequential ones, solving the bottleneck inherent in traditional disk-based structures.

The Architecture of LSM-Trees

Traditional databases often rely on B+ trees, which are excellent for range queries because they keep data in a sorted, balanced tree structure. However, B+ trees require frequent in-place updates. When you update a record in a B+ tree, the database must locate the specific disk page, load it, modify it, and write it back—a process that creates high random I/O overhead.

Conversely, an LSM-tree splits data management into two main components: an in-memory structure called a memtable and a series of immutable, sorted files on disk known as SSTables (Sorted String Tables). When a write occurs, the database simply appends the entry to the memtable (in-memory) and writes it to a write-ahead log (WAL) on disk for durability. Because memory is fast and the log is sequential, write latency is incredibly low compared to the random seeks required by a B+ tree. Once the memtable reaches a size threshold, it is flushed to disk as a new, immutable SSTable.

Why are LSM-Trees generally superior to B+ trees for write-heavy workloads?

Compaction and Read Amplification

Since LSM-trees create multiple independent SSTables during the flush process, the system can eventually contain many files containing overlapping keys. To prevent searches from scanning every file, the system performs a background process called compaction. During compaction, the database merges multiple sorted SSTables into a single, newer file, removing obsolete data (like deleted entries or older versions of updated keys).

While compaction keeps the total number of files manageable and ensures the database stays performant, it introduces a challenge known as read amplification. Because a single key could exist in the memtable or any number of SSTables, a read operation might need to check multiple files before finding the most recent version of a key. To mitigate this, databases use Bloom Filters—probabilistic data structures that tell the system whether a key definitely does not exist in a specific file, saving us from useless disk reads.

B+ Trees vs. LSM-Trees: A Trade-off

Choosing between these architectures is a classic engineering trade-off. B+ trees are optimized for read-heavy workloads because they provide a static, single location for every key, leading to predictable, logarithmic lookup times. They excel when you need to perform complex range scans across large datasets.

LSM-trees, however, treat the entire database like an append-only log. This makes them the industry standard for modern distributed databases like Apache Cassandra, RocksDB, and LevelDB. They effectively trade off read efficiency and consistent performance (due to background compaction) for an unprecedented ability to handle thousands of writes per second without the "stuttering" often seen in B+ tree structures due to page splits and balancing operations.

Compaction in an LSM-Tree is a background process that merges overlapping data into newer, optimized files.

Practical Normalization and LSM-Trees

While LSM-trees handle the physical storage layer, normalization addresses the logical structure of your data. A common pitfall is assuming that modern storage structures fix poor data design. A highly normalized database (e.g., 3NF) requires multiple joins to reconstruct an object. In an LSM-based system (often found in NoSQL databases), we frequently practice denormalization.

Because LSM-trees are optimized for single-key lookups or sequential scans, we often store entire entities as a single blob or row to avoid the cost of joining multiple files during a read. By carefully mapping our relational entities to these storage-efficient tables, we can marry the theoretical benefits of normalization with the practical speed of log-structured storage.

___ structures are used in LSM-trees to quickly determine if a specific data key is not present in an SSTable, thus avoiding unnecessary disk I/O.

Note: Always profile your workload before choosing an architecture. If your application sees 99% reads and 1% writes, the overhead of managing LSM compaction may actually perform worse than a classic, tuned B+ tree.

Key Takeaways

LSM-Trees convert random writes into sequential disk operations, making them ideal for high-ingestion storage systems.
Compaction is the cost of efficiency; while it keeps the storage engine lean, it periodically consumes CPU and I/O resources to merge files.
Bloom filters prevent unnecessary reads, acting as a crucial gatekeeper for disk-based SSTables.
Data modeling for LSM-driven systems often favors denormalization to minimize the performance impact of reads across multiple underlying files.

Finding tutorial videos...

Go deeper

How does the system handle reads with multiple SSTables?🔒
What happens to the WAL after a memtable flush?🔒
When does the system perform compaction on SSTables?🔒
How do LSM-trees manage deletions compared to updates?🔒
Why are SkipLists often preferred over balanced BSTs for memtables?🔒