What is normalization?
Database normalization is the process of organizing the columns and tables of a relational database to minimize data redundancy.
And another definition:
Normalization splits a large table into smaller tables and define relationships between them to increases the clarity in organizing data.
In general I like to see tables that have one purpose. Having the table serve many purposes introduces many of the challenges like: data duplication, data update issues, and increased effort to query data.
Single purpose and relations instead of mixed up table
Why normalize?
If you have big database that is not normalized, people working with it most likely will face lots of problems. Performance will be bad, there are high chances of having multiple rows for one object, then, some day one row will change, and then you have a conflict.
Long story short – successfully normalized databases will increase performance and reduce storage.
So this is our table, let’s say we need some kind of database for an agency.
EmployeeId | Person | Phone |
1 | John Jones | 27716602, 26153215, |
2 | Peter Janes | 47716602, 36153215, 52321262 |
3 | Michael Juines | 17716602 |
We are storing our employee id, the name and their phone numbers.
Common misconception: Our database structure heavily depends on the required functionality. For example. If we always use a full name of the person, it is not necessary to split the “Person” column into first name and last name columns.
About phone numbers, if we know, that there is a limit, for example: work phone, personal mobile phone and fax these three columns wouldn’t count as duplicative columns and would still pass first normal form requirements.
To clarify, let’s assume that we need this functionality for the finished product:
- We are going to send personalized letters. (e.g. Hello John, how are you doing?)
- People can only add their mobile phone number.
- There is no limit of the mobile phone numbers they can add.
1NF
Firstnormal formsets the very basic rules for an organized database:
- Each column is unique
- Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Requirements for 1NF
- Table must have at least one candidate key (that all attributes depend on)
- Every column stores atomic values (example, if you use full name all the time it makes sense to store full name, if you will use first name and last name separately – you should store name and last name separately)
- Not a repeating group (multiple phone numbers in one row) – row that stores multiple similar values instead of one
EmployeeId | Person | Phone |
1 | John Jones | 27716602 |
1 | John Jones | 26153215 |
2 | Peter Janes | 47716602 |
2 | Peter Janes | 36153215 |
2 | Peter Janes | 52321262 |
3 | Michael Juines | 17716602 |
Now first normal form is satisfied, as the columns in this table hold just one value.
2 Normal Form
Second normal form (2NF) further addresses the concept of removing duplicative data:
- Meet all the requirements of the first normal form.
- Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
- Create relationships between these new tables and their predecessors through the use of foreign keys
Requirements for 2NF
- Must meet all 1NF requirements
- Non-key attributes must not depend on subset of any candidate key.
As we can see in this example, we have a new table to store all the phones, and another table to determine who owns which phone number.
Employees
EmployeeId | Name | Last |
1 | John | Jones |
2 | Peter | Janes |
3 | Michael | Juines |
Phones
PhoneId | Phone |
1 | 27716602 |
2 | 26153215 |
3 | 47716602 |
4 | 36153215 |
5 | 52321262 |
6 | 17716602 |
This means EmployeeId and PhoneId are both Primary keys of the table? Not exactly, they are composite primary keys referring to entries in other two tables.
SQL code to create that table would look like this.
Transact-SQL
Create table EmployeesPhones ( CustomerID int references Employees(EmployeeId) not null, ProductID int references Phones(PhoneId) not null Primary key (CustomerID, ProductID) )
1 2 3 4 5 6 | Create table EmployeesPhones ( CustomerID int references Employees(EmployeeId) not null, ProductID int references Phones(PhoneId) not null Primary key (CustomerID, ProductID) ) |
EmployeesPhones
EmployeeId | PhoneId |
1 | 1 |
1 | 2 |
2 | 3 |
2 | 4 |
2 | 5 |
3 | 6 |
Third Normal Form (3NF)
Third normal form (3NF) goes one large step further:
- Meet all the requirements of the second normal form.
- Remove columns that are not upon the primary key.
For third normal form, I adjusted the table a little bit. I added Zip code and city columns.
We can still save a little bit of space, ZIP code determines a city, meaning that many people can live in the same city and have the same ZIP code.
We find that in the above Employees table, EmployeeId is the key and only prime key attribute. We find that City can be identified by EmployeeId as well as Zip itself. Neither Zip is a superkey nor is City a prime attribute. In this case – transitive dependency exists.
Employees
EmployeeId | Name | Last | ZIP | City |
1 | John | Jones | 8000 | Aarhus C |
2 | Peter | Janes | 8200 | Aarhus N |
3 | Michael | Juines | 8520 | Lystrup |
To bring this table into third normal form, we break the table into two tables.
Employees
EmployeeId | Name | Last | ZIP |
1 | John | Jones | 8000 |
2 | Peter | Janes | 8200 |
3 | Michael | Juines | 8520 |
Zip
Name? | City? | Zip? |
John | Aarhus C | 8000 |
Rick | Aarhus N | 8200 |
Coral | Lystrup | 8520 |
Carl | Aarhus C | 8000 |
Maggie | Lystrup | 8520 |
Daryl | Aarhus N | 8200 |
Michonne | Aarhus N | 8200 |