If you’ve taken that Database Design 101 class in college, you probably have learned that containing meaning in primary keys is not a good idea. The example that’s often given is a situation where a bank keeps track of its customers by associating them with a branch where they opened the account. This of course causes problems if the user moves to a different branch or if a branch closes altogether. These are pretty drastic scenarios and as developers, we are quickly scared away from putting meaning in keys. Or are we?
It is my personal opinion that SQL is one of the most human friendly languages. I’ve often fixed or helped fix problems with complicated queries by stepping back and explaining in human words what it is I want to retrieve from the database. My sociology knowledge narrows down to one class in college and my psychology experience is nonexistent, so these statements are pure conjecture: SQL, just as much as Venn diagrams simply make sense to people. As such, many times developers will make the leap from what they know to what they suppose is true (after all, we’re not Vulcans).
One big example of such behavior is MySQL’s AUTO_INCREMENT feature. AUTO_INCREMENT is usually used on a table field which becomes the primary key. Whenever a new record is entered, the field is assigned a new value, higher than the previous highest value in that field. Now, there is absolutely no reason a primary key should contain any notion of where in the series of records it lives.
A primary key only has one purpose: to uniquely identify a record in a table.
While using AUTO_INCREMENT gives you really nice features (like LAST_INSERT_ID() function or helping to avoid errors in a replicated system), for the purposes of building a primary key, it’s kind of overkill. All a primary key requires is to be unique.
Taking a step back: SQL was always meant to be very strict and have a very strict set of standards. Vendors who implemented SQL (such as MySQL, Oracle) are supposed to comply with these standards and certify that they comply with them. The standard SQL has become known as ANSI SQL in the late 1980’s as it was then when the American National Standards Institute adopted SQL as a standard. Unfortunately, this standard is not very well documented and/or the documentation of the standard is not easy to find. Given that there is a lot of dough to be made in software development and the enterpreneur nature of software companies, each SQL implementation slowly veered away from the standard, adding functions, finding new ways of solving problems, modifying things here and there. Depending on who you ask, these changes are portrayed as useful features added by the vendors or traps to lock clients to a specific RDBMS.
So how do all this politicking affect our primary keys? We simply now have pretty ways of creating them. In MySQL you have the already mentioned AUTO_INCREMENT. In Oracle, you can create a SEQUENCE which is a construct that holds ordered integers. Using it is very similar to AUTO_INCREMENT. You can also use the SYS_GUID() function which generates 16-byte long globally unique identifier (”GUID” is really a Microsoft word, what this function generates is really a UUID, but we can talk about the politics of this part another time). These values may look like this:
LOCATION_ID UID_COL
----------- ----------------------------------------
1000 7CD5B7769DF75CEFE034080020825436
1100 7CD5B7769DF85CEFE034080020825436
1200 7CD5B7769DF95CEFE034080020825436
1300 7CD5B7769DFA5CEFE034080020825436
On principle, I love using SYS_GUID() for primary keys. It is a true primary key with no meaning. But in practice, they’re a giant pain in the ass when you’re debugging queries or just want to quickly see what’s in a table. I think the PITA-ness of the UUIDs here is that they’re so hard for humans to read and parse. They add a layer of translation you have to do before you understand the data – this kind of goes against the idea that SQL is so easy to understand because it reads like a natural language.
Another side effect is the mixing of data retrieval with the creating of database objects. Often I stress the following:
- SQL: Structured Query Language – retrieval of data. Think SELECT.
- DML: Data Modification Language – modifying data. Think INSERT, UPDATE, DELETE.
- DDL: Data Definition Language – structure of data. Think CREATE, DROP, ALTER.
The ANSI SQL contains these three distinctions and standard SQL follows them. The reason for making a distinction is that mixing these three in any combination are mostly data integrity. You shouldn’t change the table structure at the same time as retrieving data. You shouldn’t modify data in a table when you’re also retrieving from it. The RDBMS will manage for you the distinction, so don’t worry about them, until of course, the RDBMS itself bypasses the rules.
Before databases offered fancy ways to increment values, you had to roll your own. You’d basically create a counter table that would be updated every time a new incremented number was required. Remember rollback and commit? Those only apply to DML – whenever you change data (but not data structure). Keeping all your data, including the primary key generation, as DML allows for keeping all your transactions atomic. Whenever you insert a new row into the main table, the counter table gets updated. You can rollback or commit the change in one step. It also means you’re less likely to encounter gaps in your counter and you can easily retrieve the maximum counter values.
MySQL takes this a little further with AUTO_INCREMENT. If you try to create a field in a table that is incremented automatically, but you don’t define it as a key, you’ll get the following error:
MySQL: ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key
Oracle has sequences, which are separate constructs. Values are accessed as:
MY_SEQUENCE.next_val()
Every time you get a new value, it’s used and has weird behavior in terms of rollbacks and commits that may result in gaps in the sequence.
SYS_GUID(), as I mentioned, is totally awesome (if it weren’t such a pain). But it returns a new unique value every time. This is great for some scenarios, but not useful for others.
Over past few months, there’s been blog posts popping up here and there with ominous titles such as Is the Relational Database Doomed? and others indicating an end of an era. I think what most of these posts are really a knee-jerk reaction to is that high levels of normalization are kind of unnatural to human beings. Great for computers, but unnatural (quick, someone make a joke about natural joins!) Perhaps the next generation of database courses will teach that high levels of normalization are not good at all – data integrity may be important, but being able to retrieve that data really fast is far more rewarded. Think of it this way: keeping a list of all articles on your blog is really important, but when’s the last time someone looked at your blog post from 3 years ago?
While “premature optimization is the root of all evil”, sometimes optimizing in the very early stages is good as well. This especially rings true for systems that expect a lot of data, the kind that knows that they need to be horizontally scalable without single points of failure. Heavily normalized databases are often viwed as an obstacle that needs to be overcome (slow) instead of a benefit (you can ensure the integrity of data).
Most current database optimization techniques such as table partitioning, table sharding (withing 1 or many databases) and database replication depend on the meaning of keys or some other field to make the decision where the data goes. Perhaps all users whose usernames start with A live in db01. Or maybe all users in a certain region are spread evenly over four database machines, while all other users, not being the target market, get their own wimpy machine? And the example that’s often used to scared developers: perhaps it is a good idea to keep all customers of a branch separate from other branches – yes, you’ll take a hit if a branch closes or a customer moves to a new branch, but perhaps we know how to handle it outside of the database?
As you can see: keys with meaning aren’t good or bad – they simply serve a purpose. It is up to you, as a developer, to decide the best tool for the job. Good luck!