This post is lecture notes I created for a private tutorial. It is meant for non-technical readers to follow.
Summary of Database Structure
Databases consist of tables, which in turn consist of columns. One table is for a same type of data: employee, product, etc. One column is for an attribute of the data: employee table may have id, title, hired date columns, whereas product table may have id, product name, category columns.
Each row is sometimes called a record.
For BA, table is just a slightly more restricted version of Excel spreadsheet.
Primary key
Primary Key uniquely identifies each row in table, meaning each row contain a unique PK value. Some PK candidates are employee id and product id, as they are required and unique for each row.
It is a common practice to add a xx_id column just to serve as PK when such attribute does not naturally come with the data. This is sometimes called a surrogate key. For example, it is possible that more than one employees have the same name. Therefore, it’s better to use an employee id than name for the PK.
There is also composite key, which is just primary key that is a combination of more than one columns.
Relations and Foreign Key
Relational database is built upon the idea of relations between tables.
There are three types of relations: 1-1, 1-many, many-many.
1-1 Relationship
1-1 relationship is rarely used in reality, as they can just be put into one table. For example, if one employee only holds one designated car parking lot, instead of having one employee table and one car parking lot table, we can just use car parking lot id as a column of the employee table. But of course 1-1 it has its advantage as well. In our example, if car parking lot has a bunch of its own information: geography coordinate, size, description, etc. Then it’s worth the effort of putting them into a separate table and relate the table to employee.
1-many Relationship
1-many, or many-1, is pretty straight forward. If each employee is in charge of multiple products in the company, then it is a 1-many relation. The question is: how do we store such relation? In our example, we can add a column called employee id to the product table. This is nothing but the employee’s PK. In this way, each product will have one responsible employee stored with it. Note that multiple products can have the same employee in charge. In this case, the employee id in the product table is called the Foreign Key, it is just the copy of the PK in another table.
Some formal terms: the employee table in our example is called the parent table in a 1-many relationship (the 1), whereas the products table in our example is called the child table (the many). The relationship is stored in the child table with the FK; the parent table contains no information at all regarding this relationship.
Many-many Relationship
What if we have multiple employees in charge of each product, and each employee may be in charge of multiple products as well. In this case we cannot figure out a good way to store the information but to create a table in the middle. This table’s job is to model the relation between the two other tables, hence it’s often called relation table or link table.
By creating the link table, a many-many relationship becomes two 1-many relationships: each record on the link table points has two FKs, one pointing to a record on employee table and one pointing to a record on product table.
Database Design
Normalization
The world normalization may sound esoteric but it’s actually something you do intuitively when designing tables for a database: Where do I put this information? Should I duplicate this information in both tables or is there a better way to store it? Am I creating too many sparse/boasted tables?
At the risk of over simplifying, you can understand normalization as separating information into more tables and use relations to bundle the together. Normalization done right leads to less duplication and inconsistency. But over normalization will negatively affect your performance as retrieving data requires too much table jumping. In general, always consider the following norm forms in your design, as they would make your database design much clearer.
A note: there seems to be various definition with regard to each normal form’s definition. Don’t be surprised when you see a different one in future. In the end, they are serving the same purpose.
1NF
A table is in first normal form (1NF) if every cell contains no lists or sets. So you if you have a table like the following, that’s NOT 1NF, as the offices contains a list of values:
Company | Offices |
---|---|
ABP | Auckland, Dunedin, Wellington |
But this one is:
Company | Office |
---|---|
ABP | Auckland |
ABP | Dunedin |
ABP | Wellington |
Simple, right?
2NF
A table is in 2NF if:
-
It’s already in 1NF, and
-
Every non-PK attribute is fully dependent on the PK.
What’s functionally dependent? Simple: if you can retrieve the value of attribute Y knowing the value of attribute X, then Y is dependent on X. We mark it as X -> Y. For example, “Can you tell me the name of the employee with id 123?” Given id is the PK, we can easily find the row, and then retrieve the value of the name cell. In this case we say id -> name, the name is dependent on the id.
What’s fully dependent? It’s relevant when the left side of dependency has more than one attribute. For example, if X, Y -> Z, and removing X or Y breaks this dependency, we say Z is fully dependent on X, Y.
If you think about it, tables violating 2NF only exists when there is a composite PK. In the following example, we only need company’s value to retrieve the value of company employee number, without the information of the office. Therefore the company employee number does not fully dependent on the composite PK, hence it violates 2NF.
Company (PK) | Offices (PK) | Company Employee Number | Office Employee Number |
---|---|---|---|
ABP | Auckland | 5000 | 100 |
ABP | Wellington | 5000 | 80 |
TDY | Auckland | 200 | 80 |
If we separate this table into two:
Company (PK) | Company Employee Number |
---|---|
ABP | 5000 |
TDY | 200 |
Company (PK) | Offices (PK) | Office Employee Number |
---|---|---|
ABP | Auckland | 100 |
ABP | Wellington | 80 |
TDY | Auckland | 80 |
Now both tables are in 2NF. Specifically, in the second table, we can retrieve the employee number of an office only knowing both the company and the office.
3NF
A table is in 3NF if:
- It is already in 2NF (hence 1NF too), and
- Non-key attribute cannot depend on other non-key attribute
In the following table, we find that office employee number can be retrieved just knowing the office phone number. (This is assuming one office uses one unique phone number) Therefore, office employee number is dependent on a non-key attribute. So we say this is violating 3NF.
Company (PK) | Offices (PK) | Office Phone Number | Office Employee Number |
---|---|---|---|
ABP | Auckland | 1234567 | 100 |
ABP | Wellington | 2345678 | 80 |
TDY | Auckland | 3456789 | 80 |
Data Types
We now have the tables and the relations connecting them together. We have also normalized the tables to an appropriate form. It’s time we consider the meat of the database: the data.
Data types depends on the actual RDBMS implementation. But almost all implementations have type designed to store the following categories of data:
- numerics
- small intiger
- big intiger
- decimal
- float
- text (and sometimes their corresponding unicode version)
- char
- short string
- long string
- binary/blob
- others
Here’s a summary of Sql Server (Microsoft RDBMS) basic data types:
Data Type | Description | Example |
---|---|---|
bigint | 8 bytes storing big number | 9,223,372,036,854,775,807; -9,223,372,036,854,775,808 |
int | 4 bytes storing number, most commonly used numeric type | 2,147,483,647; -2,147,483,648 |
smallint | 2 bytes storing small number | 32,767; -32,768 |
tinyint | 1 byte storing even smaller non-negative number | 0, 255 |
bit | 1 bit, storing boolean value | 1, 0 |
decimal | For exact decimals. Storage varies on required precision, up to 17 bytes | DECIMAL(5,2) (behind the scene this defines a new type with 3 digits on the left of the decimal point and 2 digits on the right) |
float | Non-accurate decimal number. Storage varies on required precision. float(24) uses 4 bytes and float(53) uses 8 bytes. | float(24), float(53) (if you use other precision, it’ll round up to either 24 or 53 anyway) |
datetime2 | A sql compliant data type to replace the old datetime. Storage varies on required precision, up to 8 bytes | datetime2(7) (7 stands for the second fraction, so we can store value like “2025-05-21 10:15:30.5555555”) |
char/nchar | Fixed-size string. Storage varies on length. The prefix “n” is meant for storing unicode string. | char(4)/nchar(4) (4 stands for the 4 bytes allocated memory, not string length. In single byte encoding this is the same thing, but in other scenarios not so) |
varchar | Variable-size string. Storage varies on length. The prefix “n” is meant for storing unicode string. | varchar(4)/nvarchar(4) (4 stands for the 4 bytes allocated memory, not string length. In single byte encoding this is the same thing, but in other scenarios not so) |
binary/varbinary | similar to char/varchar, only for binary data |
Quick note on the difference between char and varchar:
When you make a column char(n), usually the DBMS allocate n bytes of memory for each row. Even if the length of value (m) is different between rows, or if the length of value is smaller than n, it always allocate n bytes for each row.
In varchar(n), things work slightly differently. For each row, there will be a byte or two storing m (again, m is the length of the actual value for this row), and then the m bytes storing actual value. In other words, the memory allocated for varchar differs for each row (m+ 1 or m + 2 bytes), in contrast to char.
The take away is that if your strings are of fixed size n, make the column of type char(n), as it provides better query performance and in this case better storage efficiency. Otherwise, use varchar(n), as it may save you a lot of memory when many of the rows’ value don’t need that big fixed-size chunk of memory.
Index
The last section on data type has revealed a basic yet often ignored fact: in the end, data are still stored physically on the disk, and we need to think about how to organize the data to achieve the optimal storage and query efficiency.
DBMSes are pretty smart and want to help us as much as they can, but often they need some clue from DB designers. Index is one of the most important instruction we give the DBMS.
Think DBMS as a personal assistant, who helps us recording and querying contacts information on a big thick phonebook. The assistant doesn’t know anything about how we want to use the phonebook, so the only way to do their job is to simply write down the new contact we add to the next blank space on paper, and scan through the unorganized contact items every time we need to find some contact’s information. If we want to know John Smith’s phone number, the assistant can do nothing but scanning from page 1 to the last page finding all the contacts naming John Smith and reports us their phone number. This is obviously very inefficient if we have millions of contacts stored. This unstructured data storing mechanism in DB is called a heap.
Luckily we tell the assistant that most often we would use a person’s name to find the person’s information. Knowing this vital information, the assistant spend the whole night reorganizing the whole phonebook in alphabetical order according to the contacts’ name. Next time we want information about John Smith, the assistant straightly jump to letter J, and skip over Ja, Jb… to find Jo, and so on. In this case, the data are physically ordered according to one attribute (name), and the order should be maintained during adding/updating/deleting data. For example, if John Smith changes his name to Ron Smith, his item would be physically moved to the page where contacts whose name starts with R are stored. The assistant takes care of the reorganizing when we tell them to update John’s name to Ron. In this example, we use name as the attribute to determine the physical storing order of the information. We call the name attribute clustered index.
It’s obvious that physically we can store the information in just one order: the assistant cannot order contacts on name AND on street number at the same time. Hence, in DB we can have only one attribute as the clustered index.
But, what if we do want to look up contacts using street address for some reason? There is currently no easy way for the assistant to find the answer to “What’s the phone number of the guy who lives at 16 Liverpool Street?”. A thorough scan is the only option.
You might be thinking: Hey, the diligent assistant can just reorganize the phonebook based on the street address, right? The thing is that we still want to keep our phonebook organized, or clustered indexed, with contacts’ name. Remember in real world that’s still our primary usage scenario. What about making another copy of the phonebook using address as the clustered index? This is also unacceptable for storage efficiency. Soon you’d have a room full of phonebooks all storing essentially the same data.
What do we do now? The solution is to add a non-clustered index. We just tell the assistant that we often have this need to look up people with street address. The assistant spends another night adding an index to our phonebook with items like this: 1 Liverpool Street, Page 220-3; 2 Liverpool Street, Page 5-11; …; 16 Liverpool Street, Page 89-76. Now problem is solved. Next time the query comes in, the assistant just use the index to find out that 16 Liverpool Street points to page 89 line 76. Then they turn the page to 89, and there it is sting on line 76 the data we want.
Non-clustered index is just like the Index you see in many textbooks. It is a separate area storing frequently used lookup attribute with corresponding data’s location in the actual table.
Our assistant, the DBMS, knows the most efficient way to get the data we want given a query and the indices at hand. You can view the execution plan, provided by many DB utilities, to see how it’s carrying out the query task.