How far is too far for data separation when designing a database?

Vitalius · August 9, 2017, 2:44pm

I’m starting to learn how to design databases, and I’m learning preferred methods for structure.

One of the main points is to “Divide your information into subject-based tables to reduce redundant data.” which makes sense to me intuitively. Why would I store something that will be re-used 100 different times like a State abbreviation (i.e. AL for Alaska, or even Alaska as a name) in 100 different fields rather than store it once and just link to that one field?

But then I thought “how much is too much?” with that.

Say I have an employee table, and I know common Last Names like Smith, Brown, etc would appear more than once. Or a date of hire. How far should that be split out between tables linking to tables?

I understand single instance items should probably just go into the table itself, but I also have to be aware of tables that will end up with a lot of blank space because of it.

Is there a method for figuring out these types of things beyond manually sifting through the data to discern what needs to be separated and what needs to be amalgamated using logic?

Essentially a way to take already stored information that does not follow all these rules and find a database structure that fits all the data and does follow these rules.

http://www.studytonight.com/dbms/database-normalization.php

Dynamic_Gravity · August 9, 2017, 3:45pm

All data should be atomic to maintain cardinality. Usually, the solution is to create yet another table/collection.

One could argue that the amount of separation within the database would also depend on the application at hand.

But a well designed database, does not store redundant data.

Usually, prototype with a spreadsheet before hand before you actually get down to making the database. In my data-basing class, that’s what we did for the first month before we even go started making the tables.

pFtpr · August 9, 2017, 3:45pm

The main reason for reusing data is to avoid inconsistencies. When storing the same information multiple times in the database it is possible for one of the entries to get changed without the other one being updated. This can happen due to bugs - nobody expects data to be stored multiple times - or unforeseeable events like the db crashing after one entry has been changed, while the other one hasn’t.

Never store information multiple times when it refers to the same entity and is thus supposed to be changed at once.
Don’t bother reusing small items like names between multiple peopl, because the added complexity isn’t worth it. Storage is cheaper than computation power.
Use your judgment. If you expect many people to store the very same information (think dropbox) it might make sense to deduplicate that particular data.

risk · August 9, 2017, 11:22pm

Hi, I’ve been working with databases, running databases for others, and have consulted with database users for a living for the last 10years or so.

When in school, learning database theory for the first time people teach you about “normal forms”. Other things they teach you are entities and relationships and primary keys and foreign keys, etc… they have you design a database for a library and that’s it…

Then you see the real world,

and you run into a lot of practical use card for denormalization, you start thinking about clustered indices and nested tables.

And the biggest thing you end up realizing is that a database is not a good solution for many problems, and that very often data exists in log files, flat files, spreadsheets, it’s loaded into in memory data structures at startup and snapshot-ted from memory only periodically while there’s a journal/log being kept alongside.

Back to your question…

If the data is accesses together, keep it together.
If the data is accessed separately, keep it separate.
As an optimization, you can duplicate some data to avoid the cost of an additional lookup, but you need to be cognicent of how the data will end up being changed, and how to guarantee consistency.

Vitalius · August 21, 2017, 3:27pm

Coming back and reading this a second time, I had a thought.

I have control over both how the data is accessed and whether it is together or not.

I suppose asking “How do I know whether to keep it together ignoring how it is accessed” is silly since that’s a very hard question to answer.

I’m personally a fan of being organized even if it means more work now because it usually means less work later. So I’d be inclined to split up data into tables by type even if it means a lot of tables that are just an id and the data column.

dot404 · August 21, 2017, 3:37pm

This is a major weak point of mine. Recommended reading?

Levitance · August 21, 2017, 4:26pm

Personally, I will fudge some redundant data. A good example is yours regarding states in addresses. While just abbreviations would not be the best example of saving duplicated work in your database (2 letter abbreviation for a state vs a 2 digit number for the ID instead, 6 in one, half a dozen in the other), it would help you in keeping the data clean. AZ vs Az vs az vs aZ, for example.

The most important thing that I think many people forget is that the tables aren’t there to be easily human readable. They exist to hold data, and be able to present it to a computer as fast as possible. To that end, the more tables you have, the more complex your queries get. Most people get to this point in my lecture and say, “Got it! Writing queries is hard, so I want to avoid this!” Which is not where I’m going with this.

Generally you write out your queries once, and then use them over and over and over again. If your query is slow, you can refactor it, you can drop in hints, etc. You can make it faster after the fact.

If your database is terribly designed from the get-go, refactoring it is significantly harder than refactoring SQL because your application is tied into those tables and fields. So, you posing this question warms my heart more than all of the movies Disney has made in the last decade.

How far should you go in separating your data? As far as you can. Understand that at a certain point, it does become prohibitively difficult to wrap your brain around the data, and so you will have to make some sacrifices to efficiency because you need to get shit done. But if you start out making sacrifices where you could have solved the conundrum with 10 minutes of thought, you’ll end up with a monster of a database.

Also recommended reading. I have the 2nd edition of this book, and it is amazeballs.

dot404 · August 21, 2017, 4:42pm

Ordering when I get home. Thanks!!

Dynamic_Gravity · August 21, 2017, 5:03pm

bookmarked

Vitalius · August 21, 2017, 5:26pm

I think that is a great point. In terms of future work creation, SQL and websites are usually less if messed up, while a database is usually either forever a problem or a huge undertaking to fix.

Lauritzen · August 21, 2017, 5:34pm

You wanna look into normalization, which basically boiled down into it’s essence is just spread your data into tables to avoid as much repetition as possible(up to lv 3 atleast. 4+ is just torture).
someone threw the word cardinality around, which essentially is what you decide which colum is going to be a tables primary key based upon e.g. the key you choose cannot have a duplicate e.g a cpr #, or such.

in this specific case you would choose a different value as primary key, or maybe even perhaps generate your own(e.g. an id column with fx. auto increment) since this would be a unique identifier for the rest of your tables to refer to pretty much.
Basically the reason you generate primary and all that jazz is because the PK has a index on its column, and indexes are awesome for search time since it has a seperate table which contains pointers to the original table, and uses binary searches for queries. Sadly data repetitions absolutely destroyes binary searches, to a point where you can pretty much kill a database because it has to sort that colum for every data entry.
But really read up on database normalization, even though frustrating, and at times down right dump, at some point it makes sense.

Vitalius · August 21, 2017, 6:09pm

PK? Not sure what that is.

Levitance · August 21, 2017, 6:13pm

Primary Key

BGL · August 21, 2017, 6:14pm

Primary Key.

Some DB Engines like SQL Server will automatically create a Clustered Index on the Primary Key. A Clustered Index is basically the table and lays it out on disk in order. You would usually select a unique identity column that cannot be NULL for this - but sometimes it’s better to do something else or just leave the table as a heap without a PK.

Lauritzen · August 21, 2017, 6:17pm

Primary Key, FK = foreign key.

dot404 · August 21, 2017, 7:39pm

As someone who got the responsibility of asset management for the company dumped on them, this conversation is very valuable. I actually really enjoy working with data. I hated it at first though.

Vitalius · August 21, 2017, 7:41pm

“The only pivot tables I care about are the ones in Excel!”

soul dies a little

Dynamic_Gravity · August 22, 2017, 12:58am

For my final database assignment we had to design a database to mock an instance where some school would use the database to store data about classes, teachers, staff, and students.

We had 23 tables. That thing was efficient as fuck.

risk · August 22, 2017, 10:01am

Let’s say a person has a name and an address (could me multiple columns each), and you’re building a web shop and need to be able to print shipping labels on the business side of things.

If this is a very small B2B like system or something like that where your sales staff knows their customers, you might be able to get away keeping everything in one “account” table. You get all the data in one read, and are able to update everything in one write, your customers will want to ship stuff to their warehouse probably to get inventoried and won’t change billing and shipping addresses often.
Every order is a separate thing in a separate table, but may contain a reference to an account, as well as a copy of actual addresses used, it can serve as of log record.
If it contains a reference to an account, you need to maintain “referential integrity” - in this concrete case, you probably never need to delete the account.
Even though you’ve copied the address, keeping a copy is fine as it provides historical record and even though it looks the same it’s essentially a different piece of data and information you’re keeping.

If it’s a B2C and you run into folks who order things to their home address, to their parents/friends/work addresses - concept of an address starts to emerge as a separate entity type. In that case you end up having an account table, and and address table. When a customer makes an order, you can copy the address or you can reference it, but an address that is referenced, must be kept forever, as long as you keep the orders that refer to it, and you definitely don’t want to allow it to change after the fact. (Copy is a better bet).

What are the antipatterns in first case:

customers creating multiple accounts in order to have multiple addresses.

In second case:

reusing an address between customers - don’t tell me you know your customer address better than they do… hinting in a UI is fine, but ship it the way they want and don’t overoptimize

In both cases:

you don’t want to have a table called first name, another called last name, another called street name another called a zip code and so on… that just smells - you’re always accessing that data together, there’s no reason to dedup, don’t make separate tables.

When you look at your use cases you have a customer making an order, and business charging for an order an printing a label. Orders are separate from each other, keeping them as separate rows in a database makes sense. Because the account information is reused, and used possibly for other purposes, keeping a separate table for it makes sense.
If the customer address is accessed (read/written/used/has a different lifecycle/different cardinality) separately from the rest of customer account info, keep it separate. If the order address is separate from account address keep it separate from account address (doesn’t necessarily imply keeping it separate from an order itself).

When it comes to production use, things to remember are that you can always decide to change your schema, in fact as the product evolves, after a couple of years it’s unlikely that any of the tables will remain untouched as new use cases and requirements end up driving schema evolution.

In production use you’ll also end up worrying about performance, a common anti-example there is WordPress with its huge number of sequential queries it’s issuing for an average page, it makes it super impractical to keep the database in a geographically different region from the web server. (e.g. at 50ms RTT, 20 seq queries add up to 1s page render time, which IMHO is bad). There’s mitigations for that kind of case (asynchronous queries and batching), but having to spend time on mitigations sucks and is hard.

Second thing about production is that with some apps you end up hitting the database hard, with lots of queries either read or write, and you start thinking about replication and sharding/partitioning (horizontally and vertically) and nesting tables to reduce io or keeping indexable and queryable object data as your table cell values - and pretty soon you end up with some kind of hybrid database where some part of your data is in a relational database, some part in some form of clustered filesystem, some part of data in some nosql bigtable-like solution, some part in some ETL data warehouse like thing and you have custom storage for other stuff and you pray to your database admins they don’t screw up so badly that causes everything to go out of sync (ACID … lol) and your developers and sales reps don’t end up spending weeks of engineering effort cleaning things up and dealing with the fallout.

Start with as few tables as it makes sense, small number of tables are easier to reason about, when you notice data duplication, or that your queries for your use case are weird and complicated, fix. Duplication is theoretically bad. But not everything that looks like duplication is and sometimes deduplicating is more expensive than leaving it there.

In the case of AL - Alaska… I wouldn’t want to have to do a join Everytime just to be able to read an address properly.

It doesn’t mean you don’t want to keep the two letter state names in a database, and use the table contents as part of configuration for some autocomplete / validation system - but you also don’t want to read from that table on every web request use some local ram in the serving binary or local memcache and check the config periodically but don’t hammer your database reading this out all the time.

Also, there’s nothing stopping you from implementing your own versioning system or transaction systems on top of what your RDBMS already provides in order to more efficiently interact with other storage, in fact it’s very common.

Levitance · August 22, 2017, 3:02pm

I like this, let’s stop talking theory, and build this out a few different ways.

Terrible, horrible, no good design:
t_AllTheThings

Generic_ID_Field
Customer_Name
Customer_Address
Delivery_Address
Order_Info

Super terrible for many reasons.

Customer_Name field assumes First Name, optional middle name, Last Name are all in one field, forcing you to use a like clause whenever looking up customer information
New entry required even for returning customers for each new order (needlessly duplicating data)
Need to select distinct in order to list addresses (customer or delivery) in the front end, or force the customer to re-enter the addresses each time they order something.

Less Terrible:
t_Customer_Info

Customer_ID <- PK
First_Name
Last_Name
Billing_Address
Delivery_Address

t_Order_Info

Order_ID <- PK
Customer_ID <- FK
Billing_Address
Delivery_Address
… Additional fields related to the order

So this is better. Now a customer doesn’t have to re-enter their address information, and we don’t have to figure out how to pull up their address. We do lose some functionality here from the previous design, and that is a customer can only have 1 billing address and 1 delivery address associated with them at a time. But since we have Billing and Delivery addresses in the order info table, the front end can be configured to use the addresses in customer info as defaults, and then allow the customer to manually enter different addresses.

Where we fail the most here is the duplication of addresses. Assuming that most people are going to want to have multiple orders go to the same address over a given period of time, then maybe they move or something. We’re going to have the same address show up in order info several times. Can we make this better? Yes we can.

Now we’re talkin’:
t_Customer_Info

Customer_ID <- PK
First_Name
Last_Name
Default_Billing_Address
Default_Delivery_Address

t_Addresses

Address_ID <- PK
Customer_ID <- FK
Address1
Address2
City
State
Zip

t_Order_Info

Order_ID <- PK
Customer_ID <- FK
Delivery_Address <- FK
Billing_Address <- FK
… More fields related to the order

Now this is looking better. We can have recurring customers without having to duplicate their information. Each customer can have multiple addresses to work with. We could probably even add a field to differentiate between a billing address and a delivery address if we cared that much. Addresses can remain in the addresses table indefinitely for reporting purposes. We might add an active address field which would be a simple true/false that our application could use to figure out if it should show the address on a typical listing addresses associated with the customer.

The order info table references the information it needs, rather than keeping a copy all to itself. This is great for reporting since the addresses exist indefinitely, you will be able to pull an order history report and be able to see where an order went and what address it was billed to.

This design will eventually run into problems when the tables becomes incredibly full. When you get to the size of Amazon, you’ll probably be using something different.

So, what makes this a decent design?

First, we’re limiting the duplication of data. Sure, some data will be duplicated. The state field for example. Hell, even the entire set of address fields might be duplicated if you get one customer at Address A who moves to a new address and a second customer moves to Address A. This technically is duplication. But as far as duplicated address between different customers, I’d be comfortable with that level of duplication.

As for the duplication of the state, I agree with risk, I don’t think it would make a whole lot of sense to create a lookup table for states. Assuming the US (because is there really a world outside of our borders? I think it’s fake news ) we have 50 states, so we would need a 2 digit tinyint anyways. I could be wrong, but I don’t think that there is any difference between 2 digits and 2 letters when it comes to space consumption. Now if you were going to have a state lookup table that included state name, abbreviation, potential zip codes to help sanitize address input, and whatever else, now you might have something.

Now we’ve got more tables, more fields, and our queries are more complex. The complexity of the queries should only be a concern for the queries that are run on a regular basis (i.e. the queries in your application). If those queries are taking too long, they should be looked at. Given the keys in the tables, I think it’ll be a while before the tables get large enough to start negatively affecting query performance. When that does happen, you can look into getting more memory for the server, and/or start designing an archiving solution.

For adhoc queries, you can do things like create views…y’know, if you absolutely have to. Or they can build out their queries and save them for use later, like normal human beings. Personally I am aghast to create database objects that make life easier for people who manually query the database. Chances are if you want the data now, you’ll probably want it again in the future. On first request, you’ll build your adhoc query, and seriously consider putting it in a secondary reporting application. On second request, you’ll actually do it.

Two more things before I go. First, these ID fields. I don’t know what it is about management, but they absolutely hate seeing vanilla numbers in ID fields. They want to see GUIDs. It’s a buzz word, and it touches the hearts and minds of people who don’t know any better. Don’t use GUIDs for ID fields. Use plain, vanilla ints. They’re faster for both humans and computers to read.

Second, if anyone reading this wants to play with larger sets of data, the StackExchange databases are available for download.