How do B-Trees stay balanced during frequent record injections?

Explore this question in depth in our interactive lesson on Welcome to Database Storage and Normalization.

What is the primary difference between HDD and SSD storage performance?

Explore this question in depth in our interactive lesson on Welcome to Database Storage and Normalization.

What happens to system performance during a Full Table Scan?

Explore this question in depth in our interactive lesson on Welcome to Database Storage and Normalization.

What are the common side effects of database data redundancy?

Explore this question in depth in our interactive lesson on Welcome to Database Storage and Normalization.

How does page size impact overall query retrieval speed?

Explore this question in depth in our interactive lesson on Welcome to Database Storage and Normalization.

Lesson 1

Welcome to Database Storage and Normalization

~5 min50 XP

Introduction

Welcome to the foundational course on database engineering. Today, we will bridge the gap between abstract data models and the physical reality of how a Database Management System (DBMS) stores, retrieves, and organizes information to ensure data integrity and query efficiency.

The Architecture of Data Storage

At the heart of any DBMS lies the challenge of mapping logical data (tables and rows) to physical hardware (HDD or SSD storage). Computers do not store rows as humans see them; they read data in fixed-size blocks known as Pages. When you execute a query, the engine must perform I/O Operations to pull these pages from disk into memory.

To optimize this, databases utilize Indices, such as the B-Tree (Balanced Tree). A B-Tree keeps data sorted and allows for search, insertion, and deletion in logarithmic time complexity, represented as $O(\log n)$ . Without indexing, the engine would perform a Full Table Scan, reading every single block on the disk to find one record, which is prohibitively slow for large datasets.

Why are indices based on B-Trees preferred in database system storage?

The Goal of Data Integrity: Normalization

Once we understand how to store data, we must address how to structure it. Normalization is the systematic process of organizing data in a database to reduce Redundancy and improve Data Dependency. The goal is simple: isolate data so that additions, deletions, and modifications can be made in just one table.

When data is not normalized, we encounter Update Anomalies. Imagine a table storing a student’s address alongside their course enrollment. If the student moves, you must update every single row where the student is enrolled in a class. If you change a row but miss another, you end up with inconsistent data—a state known as a Data Anomaly.

First Normal Form (1NF)

The journey toward a perfect database begins with First Normal Form (1NF). A table is in 1NF if it satisfies two conditions: every column contains only Atomic Values (indivisible units), and there are no repeating groups of columns.

For example, if you have a column named PhoneNumbers containing 555-0101, 555-0102, that is NOT 1NF because the data is not atomic. You must split these into individual rows or separate columns. Achieving 1NF ensures that the database engine can easily filter and sort data using standard operators.

Note: While 1NF seems basic, it is the bedrock of relational algebra. Without atomicity, you cannot effectively perform Joins or complex calculations.

A table storing multiple email addresses in a single comma-separated cell is considered to be in 1NF.

Second Normal Form (2NF) and Full Functional Dependency

After reaching 1NF, we move to Second Normal Form (2NF). A table is in 2NF if it is already in 1NF and every Non-Prime Attribute is "fully functionally dependent" on the Primary Key.

This is particularly relevant for tables with a Composite Primary Key (a key made of two or more columns). If a column only depends on part of the primary key, it creates a Partial Dependency. We must move those columns to a different table to eliminate redundancy.

A table that is in 1NF and contains no ___ dependencies is in 2NF.

Third Normal Form (3NF)

Finally, we arrive at Third Normal Form (3NF). A table is in 3NF if it is in 2NF and contains no Transitive Dependencies. A transitive dependency occurs when a non-prime column depends on another non-prime column, rather than directly on the primary key.

Think of it this way: the non-key columns should provide info about the "key, the whole key, and nothing but the key." If you have a ZipCode column and a City column in a Users table, the City depends on the ZipCode. Since ZipCode is not the primary key, this is a transitive dependency. You should move City to a separate Locations table.

What does 3NF primarily aim to eliminate?

Key Takeaways

Pages are the fundamental unit of storage; indexing with B-Trees makes accessing these pages efficient.
Normalization is the process of reducing redundancy to prevent Update, Insertion, and Deletion Anomalies.
1NF requires Atomic Values, ensuring data is not bundled inside single cells.
2NF and 3NF require that data depends solely on prime keys, ensuring a clean, logical separation of entities.

Finding tutorial videos...

Go deeper

How do B-Trees stay balanced during frequent record injections?🔒
What is the primary difference between HDD and SSD storage performance?🔒
What happens to system performance during a Full Table Scan?🔒
What are the common side effects of database data redundancy?🔒
How does page size impact overall query retrieval speed?🔒