Maggie Nelson

databases and code goodness

  • Author: maggie
  • Published: Feb 22nd, 2009
  • Category: entry
  • Comments: 2

The Meaning of Keys

Tags: , , , ,

If you’ve taken that Database Design 101 class in college, you probably have learned that containing meaning in primary keys is not a good idea. The example that’s often given is a situation where a bank keeps track of its customers by associating them with a branch where they opened the account. This of course causes problems if the user moves to a different branch or if a branch closes altogether. These are pretty drastic scenarios and as developers, we are quickly scared away from putting meaning in keys. Or are we?

It is my personal opinion that SQL is one of the most human friendly languages. I’ve often fixed or helped fix problems with complicated queries by stepping back and explaining in human words what it is I want to retrieve from the database. My sociology knowledge narrows down to one class in college and my psychology experience is nonexistent, so these statements are pure conjecture: SQL, just as much as Venn diagrams simply make sense to people. As such, many times developers will make the leap from what they know to what they suppose is true (after all, we’re not Vulcans).

One big example of such behavior is MySQL’s AUTO_INCREMENT feature. AUTO_INCREMENT is usually used on a table field which becomes the primary key. Whenever a new record is entered, the field is assigned a new value, higher than the previous highest value in that field. Now, there is absolutely no reason a primary key should contain any notion of where in the series of records it lives.

A primary key only has one purpose: to uniquely identify a record in a table.

While using AUTO_INCREMENT gives you really nice features (like LAST_INSERT_ID() function or helping to avoid errors in a replicated system), for the purposes of building a primary key, it’s kind of overkill. All a primary key requires is to be unique.

Taking a step back: SQL was always meant to be very strict and have a very strict set of standards. Vendors who implemented SQL (such as MySQL, Oracle) are supposed to comply with these standards and certify that they comply with them. The standard SQL has become known as ANSI SQL in the late 1980’s as it was then when the American National Standards Institute adopted SQL as a standard. Unfortunately, this standard is not very well documented and/or the documentation of the standard is not easy to find. Given that there is a lot of dough to be made in software development and the enterpreneur nature of software companies, each SQL implementation slowly veered away from the standard, adding functions, finding new ways of solving problems, modifying things here and there. Depending on who you ask, these changes are portrayed as useful features added by the vendors or traps to lock clients to a specific RDBMS.

So how do all this politicking affect our primary keys? We simply now have pretty ways of creating them. In MySQL you have the already mentioned AUTO_INCREMENT. In Oracle, you can create a SEQUENCE which is a construct that holds ordered integers. Using it is very similar to AUTO_INCREMENT. You can also use the SYS_GUID() function which generates 16-byte long globally unique identifier (”GUID” is really a Microsoft word, what this function generates is really a UUID, but we can talk about the politics of this part another time). These values may look like this:

LOCATION_ID UID_COL
----------- ----------------------------------------
       1000 7CD5B7769DF75CEFE034080020825436
       1100 7CD5B7769DF85CEFE034080020825436
       1200 7CD5B7769DF95CEFE034080020825436
       1300 7CD5B7769DFA5CEFE034080020825436

On principle, I love using SYS_GUID() for primary keys. It is a true primary key with no meaning. But in practice, they’re a giant pain in the ass when you’re debugging queries or just want to quickly see what’s in a table. I think the PITA-ness of the UUIDs here is that they’re so hard for humans to read and parse. They add a layer of translation you have to do before you understand the data – this kind of goes against the idea that SQL is so easy to understand because it reads like a natural language.

Another side effect is the mixing of data retrieval with the creating of database objects. Often I stress the following:

  • SQL: Structured Query Language – retrieval of data. Think SELECT.
  • DML: Data Modification Language – modifying data. Think INSERT, UPDATE, DELETE.
  • DDL: Data Definition Language – structure of data. Think CREATE, DROP, ALTER.

The ANSI SQL contains these three distinctions and standard SQL follows them. The reason for making a distinction is that mixing these three in any combination are mostly data integrity. You shouldn’t change the table structure at the same time as retrieving data. You shouldn’t modify data in a table when you’re also retrieving from it. The RDBMS will manage for you the distinction, so don’t worry about them, until of course, the RDBMS itself bypasses the rules.

Before databases offered fancy ways to increment values, you had to roll your own. You’d basically create a counter table that would be updated every time a new incremented number was required. Remember rollback and commit? Those only apply to DML – whenever you change data (but not data structure). Keeping all your data, including the primary key generation, as DML allows for keeping all your transactions atomic. Whenever you insert a new row into the main table, the counter table gets updated. You can rollback or commit the change in one step. It also means you’re less likely to encounter gaps in your counter and you can easily retrieve the maximum counter values.

MySQL takes this a little further with AUTO_INCREMENT. If you try to create a field in a table that is incremented automatically, but you don’t define it as a key, you’ll get the following error:

MySQL: ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key

Oracle has sequences, which are separate constructs. Values are accessed as:

MY_SEQUENCE.next_val()

Every time you get a new value, it’s used and has weird behavior in terms of rollbacks and commits that may result in gaps in the sequence.

SYS_GUID(), as I mentioned, is totally awesome (if it weren’t such a pain). But it returns a new unique value every time. This is great for some scenarios, but not useful for others.

Over past few months, there’s been blog posts popping up here and there with ominous titles such as Is the Relational Database Doomed? and others indicating an end of an era. I think what most of these posts are really a knee-jerk reaction to is that high levels of normalization are kind of unnatural to human beings. Great for computers, but unnatural (quick, someone make a joke about natural joins!) Perhaps the next generation of database courses will teach that high levels of normalization are not good at all – data integrity may be important, but being able to retrieve that data really fast is far more rewarded. Think of it this way: keeping a list of all articles on your blog is really important, but when’s the last time someone looked at your blog post from 3 years ago?

While “premature optimization is the root of all evil”, sometimes optimizing in the very early stages is good as well. This especially rings true for systems that expect a lot of data, the kind that knows that they need to be horizontally scalable without single points of failure. Heavily normalized databases are often viwed as an obstacle that needs to be overcome (slow) instead of a benefit (you can ensure the integrity of data).

Most current database optimization techniques such as table partitioning, table sharding (withing 1 or many databases) and database replication depend on the meaning of keys or some other field to make the decision where the data goes. Perhaps all users whose usernames start with A live in db01. Or maybe all users in a certain region are spread evenly over four database machines, while all other users, not being the target market, get their own wimpy machine? And the example that’s often used to scared developers: perhaps it is a good idea to keep all customers of a branch separate from other branches – yes, you’ll take a hit if a branch closes or a customer moves to a new branch, but perhaps we know how to handle it outside of the database?

As you can see: keys with meaning aren’t good or bad – they simply serve a purpose. It is up to you, as a developer, to decide the best tool for the job. Good luck!

  • Author: maggie
  • Published: Feb 7th, 2009
  • Category: entry
  • Comments: 21

Denormalization with Bitmasks

Tags: , , , , ,

This is an oldie, but goodie.

A lot of times you’ll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.

This will either cause queries in loops (do a SELECT for every row returned from the first SELECT). Or you might find yourself writing overly complicated pivot queries. (My personal choice has always been writing my own global Oracle functions for data aggregation – talk about overkill!)

Even if you have great hardware, your database will inevitably become your bottleneck, so take care of your database: design and build to avoid unnecessary database access and once you’re in the database, get your data as fast as possible and get the heck out!

Back to basics is your solution and there’s really not many things more basic than bitmasks. Let’s say you have a social networking site (who doesn’t these days?). When users register (or at some point later on), they can specify what kind of pets they own. You can use this data to perhaps match those users – cat lovers like other cat lovers, right? (*shifty eyes*)

This is how the relationship between users and pets might look like in your database:

user_pet_erd

Let’s say you have the following data in those tables:

user:

+----+----------+
| id | username |
+----+----------+
|  1 | Maggie   |
|  2 | Sully    |
+----+----------+

pet:

+----+------------+
| id | name       |
+----+------------+
|  1 | cat        |
|  2 | dog        |
|  3 | lizard     |
|  4 | parakeet   |
|  5 | guinea pig |
|  6 | snake      |
|  7 | unicorn    |
|  8 | ferret     |
+----+------------+

user_pet:

+---------+--------+
| user_id | pet_id |
+---------+--------+
|       1 |      1 |
|       1 |      4 |
|       1 |      5 |
|       2 |      7 |
|       2 |      8 |
+---------+--------+

How do I find out easily what kinds of pets my users have? Oh, it’s easy:

Maggie has:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Maggie'
   and up.user_id = u.id
   and p.id = up.pet_id;

+------------+
| name       |
+------------+
| cat        |
| parakeet   |
| guinea pig |
+------------+

And Sully has*:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Sully'
   and up.user_id = u.id
   and p.id = up.pet_id;

+---------+
| name    |
+---------+
| unicorn |
| ferret  |
+---------+

* Of course Sully has a unicorn! (He also poops nanchucks.)

It’s easy to find out what kind of pets users have one at a time. Let’s do it in one big swoop though:

select u.username,
       p.name
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id;

+----------+------------+
| username | name       |
+----------+------------+
| Maggie   | cat        |
| Maggie   | parakeet   |
| Maggie   | guinea pig |
| Sully    | unicorn    |
| Sully    | ferret     |
+----------+------------+

You get all the data you wanted, however, it’s all spread over many rows; some data aggregation is required! You’re probably thinking to yourself: “Oh, man, if only there were a function that works just like SUM() but for strings!”. If you’re using MySQL, you’re in luck thanks to the awesome GROUP_CONCAT() function. If you use it, you’ll get this:

select u.username, group_concat(p.name)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-------------------------+
| username | group_concat(p.name)    |
+----------+-------------------------+
| Maggie   | cat,guinea pig,parakeet |
| Sully    | ferret,unicorn          |
+----------+-------------------------+

Pretty swanky, eh? However, while group_concat() is totally awesome (oh, I love it so!), it’s only available in MySQL. (Although comments on this group_concat() praising blog post have an example of how to accomplish group_concat() in postgress, which you could also achieve in Oracle.) But I digress. Also, once you have the string representing all the pets, you’ll need to parse it in your application to get the pet names. And most importantly, this data is denormalized in a way that makes it a little difficult to enforce data integrity once you modify the list.

If group_concat() didn’t exist, what else can you do? Thinking of group_concat() as a sum() but for strings instead of numbers is an interesting approach. Next step: what do you have available out there that would yield a sum for a combination of values that could then be reverse engineered to get that unique combination of values again? That’s right, bitmasks!

How do you implement this in your user-pet scenario? It’s easy: first, represent bits in your bitmask as decimals on which you can later do math. Use a table to keep track of the bit-to-decimal translation for additional data integrity. Use those decimals as keys for your pets.

First, I create a table named power_of_two with the following values:

+----------+-------+
| exponent | power |
+----------+-------+
|        0 |     1 |
|        1 |     2 |
|        2 |     4 |
|        3 |     8 |
|        4 |    16 |
|        5 |    32 |
|        6 |    64 |
|        7 |   128 |
+----------+-------+

Then I use those values as IDs in the pet table:

+-----+------------+
| id  | name       |
+-----+------------+
|   1 | iguana     |
|   2 | cat        |
|   4 | dog        |
|   8 | lizard     |
|  16 | parakeet   |
|  32 | guinea pig |
|  64 | snake      |
| 128 | unicorn    |
| 256 | ferret     |
+-----+------------+

Note that I added an iguana as pet with id of 1 – this is 2 to the 0th power. I needed to add a pet at this spot to account for the rightmost place in my bitmasks (otherwise I have to shift everything by 1, which can be confusing).

We have 9 possible pets. If I have no pets, my bitmask will be 000000000 – or 0. If I have the first and the third pet, my bitmask will be 000000101 – or 5.

Assuming these changes to my database, let’s see what kinds of pets Maggie and Sully have:

select u.username, sum(p.id)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-----------+
| username | sum(p.id) |
+----------+-----------+
| Maggie   |        50 |
| Sully    |       384 |
+----------+-----------+

50 = 0b110010 – so Maggie has the 2nd, 4th and 5th pet. 384 = 0b110000000, so Sully has the 8th and the 9th pets.

Let’s assume your application heavily caches data that doesn’t change often – such as the pet table (not user_pet). Let’s say you have this cached:

$pets = array(
    1 => 'iguana',
    2 => 'cat',
    4 => 'dog',
    8 => 'lizard',
    16 => 'parakeet',
    32 => 'guinea pig',
    64 => 'snake',
    128 => 'unicorn',
    256 => 'ferret'
);

If your application already knows this:

Maggie: 50
Sully: 384

Then with a clever use of PHP’s decbin(), it should be really easy to display the right pet names.

This approach is great for when you have a very limited number of times you can connect to the database and once you’re there, you don’t have the luxury of running expensive aggregate queries. Also, web servers are easier to scale than coming up with error-proof database scaling solutions. I’ve also found this approach extremely useful when migrating lots of data about users (think millions of rows) from one system to another. Additionally, by having the power_of_two table, you’re able to be somewhat strict about the possible values for the user_pet table – this gives you slightly more data integrity than just using the concatenated string of names.

But the best part about using bitmasks is that you can do math on them! What’s easier to compare:

iguana,cat,dog,guinea pig,unicorn,ferret
to
iguana,cat,guinea pig,snake,unicorn,ferret

Here you have to possibly split the string on the comma into an array, then do array comparisons.

OR

110100111
111100011

Here you can use a bitwise AND to figure out the overlap!

Better yet, you can easily give people more pets!

110100111
OR
111111111

And now that person has all pets. And you didn’t even have to check if they had them before! Hooray for bitmasks!

Remember, as with many other optimization techniques, this one will not be appropriate for every scenario, but when you do need it, I promise it will work very very well!

P.S. I found it ironic to talk about a function that MySQL has but not Oracle – it’s usually the other way around. The str_agg (string aggregate) function I’ve had to write is so clunky – Oracle, please implement group_concat()!

  • Author: maggie
  • Published: Aug 21st, 2008
  • Category: entry
  • Comments: 4

To persist or not to persist?

Tags:

Recently, in a new PHP + MySQL application I’m building, I was faced with a choice of using persistent connections. The benefit of persistent connections is that you don’t have to keep opening new connections every time you need to hit the database for something. There’s a connection already waiting for you. Yay, right? Well, with MySQL, connecting is actually really really cheap, and frankly, if you are using persistent connections, you might encounter some issues with Apache going zombie on processes that use a connection, effectively taking that connection out of use. Grrr.
“To Google!”, I said. “The Interwebs must have a solution for this!” I’ve searched and searched trying to find a definite answer about whether to use or not to use persistent connections in a LAMP application. There seem to be lots of differing opinions, namely this famous presentation by Rasmus which basically “proves” that persistent connections are awesome. However, the PHP community seemed to be pretty against them, despite that evidence.
At this point, it was just annoying. Do I use persistent connections or not? Having this conversation internally with Ben Ramsey and Robert Swarthout, it seemed a good idea to have some hard numbers to back up our claims, so we wouldn’t have to say “well, guy x said so” as part of the argument.
Here are the results of Robert’s tests: Benchmarking of MySQL Persistent Connections vs Non-Persistent Connections. As Robert says:

Basically what the numbers above shows us is that in an isolated environment it makes no difference which connection type you are going to use.

Jay Pipes says that “if you use Apache, Apache can zombie a PHP process and cause the mysql connection to be held until the mysql server restarts…” Given the risk, the possible management overhead and the results of the test, we’re not going to use persistent connections on this upcoming project.

  • Author: maggie
  • Published: May 23rd, 2008
  • Category: entry
  • Comments: 2

Angering Database Gods

Tags: ,

Slides for Angering Database Gods from php|tek 2008.
I was really surprised how many people also have issues with ORM implementations in PHP frameworks! Also, give me coffee for 9am presentations…
For some follow-up reading, Sebastian recommends Why Active Record Sucks.

  • Author: maggie
  • Published: May 22nd, 2008
  • Category: entry
  • Comments: 4

Keeping Your Database and PHP in Sync

Tags: , ,

Slides for Keeping Your Database and PHP in Sync from php|tek 2008.
For those of you interested whether I’m writing this tool to release, check out dbMorph on SourceForge.

© 2010 Maggie Nelson. All Rights Reserved.

This blog is powered by the Wordpress platform and beach rentals.