I like this, let’s stop talking theory, and build this out a few different ways.
Terrible, horrible, no good design:
t_AllTheThings
- Generic_ID_Field
- Customer_Name
- Customer_Address
- Delivery_Address
- Order_Info
Super terrible for many reasons.
- Customer_Name field assumes First Name, optional middle name, Last Name are all in one field, forcing you to use a like clause whenever looking up customer information
- New entry required even for returning customers for each new order (needlessly duplicating data)
- Need to select distinct in order to list addresses (customer or delivery) in the front end, or force the customer to re-enter the addresses each time they order something.
Less Terrible:
t_Customer_Info
- Customer_ID <- PK
- First_Name
- Last_Name
- Billing_Address
- Delivery_Address
t_Order_Info
- Order_ID <- PK
- Customer_ID <- FK
- Billing_Address
- Delivery_Address
- … Additional fields related to the order
So this is better. Now a customer doesn’t have to re-enter their address information, and we don’t have to figure out how to pull up their address. We do lose some functionality here from the previous design, and that is a customer can only have 1 billing address and 1 delivery address associated with them at a time. But since we have Billing and Delivery addresses in the order info table, the front end can be configured to use the addresses in customer info as defaults, and then allow the customer to manually enter different addresses.
Where we fail the most here is the duplication of addresses. Assuming that most people are going to want to have multiple orders go to the same address over a given period of time, then maybe they move or something. We’re going to have the same address show up in order info several times. Can we make this better? Yes we can.
Now we’re talkin’:
t_Customer_Info
- Customer_ID <- PK
- First_Name
- Last_Name
- Default_Billing_Address
- Default_Delivery_Address
t_Addresses
- Address_ID <- PK
- Customer_ID <- FK
- Address1
- Address2
- City
- State
- Zip
t_Order_Info
- Order_ID <- PK
- Customer_ID <- FK
- Delivery_Address <- FK
- Billing_Address <- FK
- … More fields related to the order
Now this is looking better. We can have recurring customers without having to duplicate their information. Each customer can have multiple addresses to work with. We could probably even add a field to differentiate between a billing address and a delivery address if we cared that much. Addresses can remain in the addresses table indefinitely for reporting purposes. We might add an active address field which would be a simple true/false that our application could use to figure out if it should show the address on a typical listing addresses associated with the customer.
The order info table references the information it needs, rather than keeping a copy all to itself. This is great for reporting since the addresses exist indefinitely, you will be able to pull an order history report and be able to see where an order went and what address it was billed to.
This design will eventually run into problems when the tables becomes incredibly full. When you get to the size of Amazon, you’ll probably be using something different.
So, what makes this a decent design?
First, we’re limiting the duplication of data. Sure, some data will be duplicated. The state field for example. Hell, even the entire set of address fields might be duplicated if you get one customer at Address A who moves to a new address and a second customer moves to Address A. This technically is duplication. But as far as duplicated address between different customers, I’d be comfortable with that level of duplication.
As for the duplication of the state, I agree with risk, I don’t think it would make a whole lot of sense to create a lookup table for states. Assuming the US (because is there really a world outside of our borders? I think it’s fake news ) we have 50 states, so we would need a 2 digit tinyint anyways. I could be wrong, but I don’t think that there is any difference between 2 digits and 2 letters when it comes to space consumption. Now if you were going to have a state lookup table that included state name, abbreviation, potential zip codes to help sanitize address input, and whatever else, now you might have something.
Now we’ve got more tables, more fields, and our queries are more complex. The complexity of the queries should only be a concern for the queries that are run on a regular basis (i.e. the queries in your application). If those queries are taking too long, they should be looked at. Given the keys in the tables, I think it’ll be a while before the tables get large enough to start negatively affecting query performance. When that does happen, you can look into getting more memory for the server, and/or start designing an archiving solution.
For adhoc queries, you can do things like create views…y’know, if you absolutely have to. Or they can build out their queries and save them for use later, like normal human beings. Personally I am aghast to create database objects that make life easier for people who manually query the database. Chances are if you want the data now, you’ll probably want it again in the future. On first request, you’ll build your adhoc query, and seriously consider putting it in a secondary reporting application. On second request, you’ll actually do it.
Two more things before I go. First, these ID fields. I don’t know what it is about management, but they absolutely hate seeing vanilla numbers in ID fields. They want to see GUIDs. It’s a buzz word, and it touches the hearts and minds of people who don’t know any better. Don’t use GUIDs for ID fields. Use plain, vanilla ints. They’re faster for both humans and computers to read.
Second, if anyone reading this wants to play with larger sets of data, the StackExchange databases are available for download.