Welcome to the foundational course on database engineering. Today, we will bridge the gap between abstract data models and the physical reality of how a Database Management System (DBMS) stores, retrieves, and organizes information to ensure data integrity and query efficiency.
At the heart of any DBMS lies the challenge of mapping logical data (tables and rows) to physical hardware (HDD or SSD storage). Computers do not store rows as humans see them; they read data in fixed-size blocks known as Pages. When you execute a query, the engine must perform I/O Operations to pull these pages from disk into memory.
To optimize this, databases utilize Indices, such as the B-Tree (Balanced Tree). A B-Tree keeps data sorted and allows for search, insertion, and deletion in logarithmic time complexity, represented as . Without indexing, the engine would perform a Full Table Scan, reading every single block on the disk to find one record, which is prohibitively slow for large datasets.
Once we understand how to store data, we must address how to structure it. Normalization is the systematic process of organizing data in a database to reduce Redundancy and improve Data Dependency. The goal is simple: isolate data so that additions, deletions, and modifications can be made in just one table.
When data is not normalized, we encounter Update Anomalies. Imagine a table storing a studentβs address alongside their course enrollment. If the student moves, you must update every single row where the student is enrolled in a class. If you change a row but miss another, you end up with inconsistent dataβa state known as a Data Anomaly.
The journey toward a perfect database begins with First Normal Form (1NF). A table is in 1NF if it satisfies two conditions: every column contains only Atomic Values (indivisible units), and there are no repeating groups of columns.
For example, if you have a column named PhoneNumbers containing 555-0101, 555-0102, that is NOT 1NF because the data is not atomic. You must split these into individual rows or separate columns. Achieving 1NF ensures that the database engine can easily filter and sort data using standard operators.
Note: While 1NF seems basic, it is the bedrock of relational algebra. Without atomicity, you cannot effectively perform Joins or complex calculations.
After reaching 1NF, we move to Second Normal Form (2NF). A table is in 2NF if it is already in 1NF and every Non-Prime Attribute is "fully functionally dependent" on the Primary Key.
This is particularly relevant for tables with a Composite Primary Key (a key made of two or more columns). If a column only depends on part of the primary key, it creates a Partial Dependency. We must move those columns to a different table to eliminate redundancy.
Finally, we arrive at Third Normal Form (3NF). A table is in 3NF if it is in 2NF and contains no Transitive Dependencies. A transitive dependency occurs when a non-prime column depends on another non-prime column, rather than directly on the primary key.
Think of it this way: the non-key columns should provide info about the "key, the whole key, and nothing but the key." If you have a ZipCode column and a City column in a Users table, the City depends on the ZipCode. Since ZipCode is not the primary key, this is a transitive dependency. You should move City to a separate Locations table.