Denormalization with Bitmasks
This is an oldie, but goodie.
A lot of times you'll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.
This will either cause queries in loops (do a SELECT for every row returned from the first SELECT). Or you might find yourself writing overly complicated pivot queries. (My personal choice has always been writing my own global Oracle functions for data aggregation - talk about overkill!)
Even if you have great hardware, your database will inevitably become your bottleneck, so take care of your database: design and build to avoid unnecessary database access and once you're in the database, get your data as fast as possible and get the heck out!
Back to basics is your solution and there's really not many things more basic than bitmasks. Let's say you have a social networking site (who doesn't these days?). When users register (or at some point later on), they can specify what kind of pets they own. You can use this data to perhaps match those users - cat lovers like other cat lovers, right? (*shifty eyes*)
Let's say you have the following data in those tables:
user:
+----+----------+
| id | username |
+----+----------+
| 1 | Maggie |
| 2 | Sully |
+----+----------+
pet:
+----+------------+
| id | name |
+----+------------+
| 1 | cat |
| 2 | dog |
| 3 | lizard |
| 4 | parakeet |
| 5 | guinea pig |
| 6 | snake |
| 7 | unicorn |
| 8 | ferret |
+----+------------+
user_pet:
+---------+--------+
| user_id | pet_id |
+---------+--------+
| 1 | 1 |
| 1 | 4 |
| 1 | 5 |
| 2 | 7 |
| 2 | 8 |
+---------+--------+
How do I find out easily what kinds of pets my users have? Oh, it's easy:
Maggie has:
select p.name
from user_pet up,
pet p,
user u
where u.username = 'Maggie'
and up.user_id = u.id
and p.id = up.pet_id;
+------------+
| name |
+------------+
| cat |
| parakeet |
| guinea pig |
+------------+
And Sully has*:
select p.name
from user_pet up,
pet p,
user u
where u.username = 'Sully'
and up.user_id = u.id
and p.id = up.pet_id;
+---------+
| name |
+---------+
| unicorn |
| ferret |
+---------+
* Of course Sully has a unicorn! (He also poops nanchucks.)
It's easy to find out what kind of pets users have one at a time. Let's do it in one big swoop though:
select u.username, p.name from user_pet up, pet p, user u where up.user_id = u.id and p.id = up.pet_id; +----------+------------+ | username | name | +----------+------------+ | Maggie | cat | | Maggie | parakeet | | Maggie | guinea pig | | Sully | unicorn | | Sully | ferret | +----------+------------+
You get all the data you wanted, however, it's all spread over many rows; some data aggregation is required! You're probably thinking to yourself: "Oh, man, if only there were a function that works just like SUM() but for strings!". If you're using MySQL, you're in luck thanks to the awesome GROUP_CONCAT() function. If you use it, you'll get this:
select u.username, group_concat(p.name) from user_pet up, pet p, user u where up.user_id = u.id and p.id = up.pet_id group by u.username; +----------+-------------------------+ | username | group_concat(p.name) | +----------+-------------------------+ | Maggie | cat,guinea pig,parakeet | | Sully | ferret,unicorn | +----------+-------------------------+
Pretty swanky, eh? However, while group_concat() is totally awesome (oh, I love it so!), it's only available in MySQL. (Although comments on this group_concat() praising blog post have an example of how to accomplish group_concat() in postgress, which you could also achieve in Oracle.) But I digress. Also, once you have the string representing all the pets, you'll need to parse it in your application to get the pet names. And most importantly, this data is denormalized in a way that makes it a little difficult to enforce data integrity once you modify the list.
If group_concat() didn't exist, what else can you do? Thinking of group_concat() as a sum() but for strings instead of numbers is an interesting approach. Next step: what do you have available out there that would yield a sum for a combination of values that could then be reverse engineered to get that unique combination of values again? That's right, bitmasks!
How do you implement this in your user-pet scenario? It's easy: first, represent bits in your bitmask as decimals on which you can later do math. Use a table to keep track of the bit-to-decimal translation for additional data integrity. Use those decimals as keys for your pets.
First, I create a table named power_of_two with the following values:
+----------+-------+ | exponent | power | +----------+-------+ | 0 | 1 | | 1 | 2 | | 2 | 4 | | 3 | 8 | | 4 | 16 | | 5 | 32 | | 6 | 64 | | 7 | 128 | +----------+-------+
Then I use those values as IDs in the pet table:
+-----+------------+ | id | name | +-----+------------+ | 1 | iguana | | 2 | cat | | 4 | dog | | 8 | lizard | | 16 | parakeet | | 32 | guinea pig | | 64 | snake | | 128 | unicorn | | 256 | ferret | +-----+------------+
Note that I added an iguana as pet with id of 1 - this is 2 to the 0th power. I needed to add a pet at this spot to account for the rightmost place in my bitmasks (otherwise I have to shift everything by 1, which can be confusing).
We have 9 possible pets. If I have no pets, my bitmask will be 000000000 - or 0. If I have the first and the third pet, my bitmask will be 000000101 - or 5.
Assuming these changes to my database, let's see what kinds of pets Maggie and Sully have:
select u.username, sum(p.id) from user_pet up, pet p, user u where up.user_id = u.id and p.id = up.pet_id group by u.username; +----------+-----------+ | username | sum(p.id) | +----------+-----------+ | Maggie | 50 | | Sully | 384 | +----------+-----------+
50 = 0b110010 - so Maggie has the 2nd, 4th and 5th pet. 384 = 0b110000000, so Sully has the 8th and the 9th pets.
Let's assume your application heavily caches data that doesn't change often - such as the pet table (not user_pet). Let's say you have this cached:
$pets = array( 1 => 'iguana', 2 => 'cat', 4 => 'dog', 8 => 'lizard', 16 => 'parakeet', 32 => 'guinea pig', 64 => 'snake', 128 => 'unicorn', 256 => 'ferret' );
If your application already knows this:
Maggie: 50
Sully: 384
Then with a clever use of PHP's decbin(), it should be really easy to display the right pet names.
This approach is great for when you have a very limited number of times you can connect to the database and once you're there, you don't have the luxury of running expensive aggregate queries. Also, web servers are easier to scale than coming up with error-proof database scaling solutions. I've also found this approach extremely useful when migrating lots of data about users (think millions of rows) from one system to another. Additionally, by having the power_of_two table, you're able to be somewhat strict about the possible values for the user_pet table - this gives you slightly more data integrity than just using the concatenated string of names.
But the best part about using bitmasks is that you can do math on them! What's easier to compare:
iguana,cat,dog,guinea pig,unicorn,ferret
to
iguana,cat,guinea pig,snake,unicorn,ferret
Here you have to possibly split the string on the comma into an array, then do array comparisons.
OR
110100111
111100011
Here you can use a bitwise AND to figure out the overlap!
Better yet, you can easily give people more pets!
110100111
OR
111111111
And now that person has all pets. And you didn't even have to check if they had them before! Hooray for bitmasks!
Remember, as with many other optimization techniques, this one will not be appropriate for every scenario, but when you do need it, I promise it will work very very well!
P.S. I found it ironic to talk about a function that MySQL has but not Oracle - it's usually the other way around. The str_agg (string aggregate) function I've had to write is so clunky - Oracle, please implement group_concat()!
Comments (21)
2009-02-08 18:14:55 maggie said:
Raul - for one, I'd have to change my column names and I'm lazy :) Seriously though, I didn't use a natural join here because I don't usually use them at all. They tend to sometimes have issues with ambiguous results (see an example in Oracle documentation here). Since I don't use them often at all, I didn't think of using them here. The result set is very small, so any performance considerations go out the window (although now I'm curious about the performance benchmarks on inner join vs. natural join implementations across the common databases!) I'll see if WordPress enables comment preview!
2009-02-08 18:44:39 Raul Pedro Santos said:
I have to admit I don't usually use anything other than "id" for my primary keys, which forces me to use the USING or ON clauses along with JOINs ;) As for the problem you mention, some time ago I bumped into a strange when I upgraded an application to MySQL 5 and if I'm not mistaken, from 5.x on MySQL forces you to use parentheses precisely to prevent that ambiguity (the error I was seing was due to a JOIN that didn't have the parentheses, if I recall correctly). But yes, it's a perfectly valid concern. I thought about the size of the result set, too, but I guess that even with a small result set even a small performance gain can be advantageous when you're dealing with an application that has a high volume of traffic from the database - in other words, if the query gets executed a lot in a small amout of time.
2009-02-08 18:57:16 Michiel Hakvoort said:
Hi Maggie. Interesting read! I was thinking you could actually do this without denormalizing (if Oracle supports shifting :) your database by using select u.username, sum(1 << p.id) from user_pet up, pet p, user u where up.user_id = u.id and p.id = up.pet_id group by u.username; It is however somewhat less transparent and a (tiny) little more computationally intensive, but then again, using exclusively powers of two as keys in order to be able to generate bit masks isn't that transparent either :-p. If you really want your keys to remain under 32, you can always use a checks, triggers, or a relation containing the numbers 0 till 31 for that of course. Also, I'm wondering about one thing, as you mentioned "... And now that person has all pets. And you didn’t even have to check if they had them before! Hooray for bitmasks!" As this post is about database design, I'm wondering whether this is about assigning pets to a user in the database or, temporarily, in the application. (Of course, storing the user-pet relation in a bit mask itself and dropping the user_pet relation would yield poor performance when searching for other cat lovers :-p.) Finally, I must say I have to agree with SchizoDuckie on the particular use. Although it can be beneficial in some cases (small sets)... If you're going to have to cache values anyways I'd rather go for offline aggregating (and even offline joining) to get some performance increase, rather than adapting the database design to a solution which isn't really suited for further growth (in case of adding more animals, that is :-). But for the "some cases" I think the ability of using bit masks with the combination of normalization (perhaps a little de-normalization) is pretty elegant.
2009-02-08 23:01:45 Dave Steinberg said:
Great post Magda!
2009-02-09 06:22:46 Simon said:
I'd like to see it stressed that this approach is something of a last resort, and should only be taken if you already have real performance problems. Otherwise one would be merely obfuscating one's database for no gain - premature optimisation, really. I wonder if you could explain the value of the 'power_of_two' table? That's somewhat lost on me as, well, computers know that sort of thing already!
2009-02-09 10:15:20 maggie said:
Simon - I definitely agree. This is not a solution I would use for everything - as with many other optimization techniques, this one works great, but only under very specific circumstances. The power_of_two tables allows you quick lookups of the values instead of computing pow(2, x). You could also possibly add a check constraint on pet.id where the id has to be in the set of values available in power_of_two. It might be useful when calculating results for millions of rows - you wouldn't want to calculate the power of 2 for each of those. But again, this works for very specific circumstances.
2009-02-09 20:04:47 David said:
Alternatively, you can also use this syntax if you want to specify which columns are used in a natural join: SELECT * FROM user JOIN user_pet USING (user_id) JOIN pet USING (pet_id) This works great for the most part, and also removes duplicate columns, as the columns used in the join are collapsed. So you can use the column name in a WHERE clause without having to specify the table: SELECT * FROM user JOIN user_pet USING (user_id) JOIN pet USING (pet_id) WHERE user_id = 1 It's a handy syntax for small simple queries. Bitmasking is way cool though. As for SchizoDuckie's comment. A real programmer should know a bitmask when he sees one, so I wouldn't see it as a major issue. The problem is that many web developers aren't real programmers, and have no formal training in programming (myself included). Sure, we should keep things simple, but in some ways, this is the simpler solution. Or maybe using SET is?
2009-02-13 15:07:59 Log Buffer said:
Maggie Nelson elucidates what she deems an oldie but a goodie, denormalization with bitmasks. Very slick. Log Buffer #135
2009-02-08 03:15:51 Josh Davis said:
Forgive me for asking but isn't that a reimplementation of MySQL's SET type? http://dev.mysql.com/doc/refman/5.1/en/set.html
2009-02-08 06:14:06 PaulG said:
Awesome, thank you. Thank the lord - a decent example to explain the power of bitmasks on the forums. ps If you start from the top of the page your first user_pet table seems to have the power-of-two value as pet_id (FK) refs and does not match the table above, which confused me for a bit (Ahhh a pun!).
2009-02-08 07:35:48 SchizoDuckie said:
Okay, the use of group_concat i like, that's kind of cool. But why would you stuff away a normal SQL Join operation within bitmasked data? That's just machochism. I'm a huge fan of keeping the database just that. No weird constructions, make sure that a completely different app / programmer could just read your data. Before resorting to these kind of optimisations, you should already have explored serializing and putting data that's not being changed so often to file, memcache ánd mysql replication.
2009-02-08 08:25:13 Raul Pedro Santos said:
Great post, very interesting idea! Looks like you mixed some stuff up there in the beginning, by using the powers of two for the "pet_id"s in the "user_pet" table, while the "pet" table has sequencial ids. You might want to fix that :) I also have one question: why not use joins instead of manually matching the user_id?
2009-02-08 09:59:08 Brian DeShong said:
Just remember that, if you're dealing with a lot of values, you'll have to worry about the maximum integer that PHP can handle on your system (see the PHP_INT_MAX constant). The max is 2147483647 on 32-bit systems, which will allow for 31 values (er, 30?). So, if you've got 30+ values and are on a 32-bit system, this may not work out nicely for you. Though surely there is some way around that...it's only an integer. I guess you could use scientific notation or something along those lines.
2009-02-08 11:44:35 maggie said:
That's a very good point that I forgot to mention. This technique works best when dealing with values that you specify ahead of time in your application: pets users might have, hobbies they like - think of them as things you can put in in a drop-down menu.
2009-02-08 12:47:40 maggie said:
Josh - I'm not that familiar with SET in MySQL, but after reading about it a little, yes, it does seem quite similar. Thank you for pointing it out!
2009-02-08 12:50:14 maggie said:
PaulG - thanks! Just fixed it to use decimals instead of bits! (Nice pun there!)
2009-02-08 12:53:53 maggie said:
SchizoDuckie - the problem being solved here is not a simple join but being able to retrieve a result of a group by expression which results in many rows as one row per the group by field. The data in table PET and USER can be easily cached for a long time, but assuming a system where users often change their pet preferences, data in USER_PET will change often, hence caching might not be the best solution for retrieving this data.
2009-02-08 12:55:20 maggie said:
Raul - fixed the bugs. Thanks for pointing them out :) As for your question, could you specify which example you are referring to?
2009-02-08 16:04:34 Raul Pedro Santos said:
maggie, I mean any of the examples where you do [code]where up.user_id = u.id and p.id = up.pet_id[/code] in order to get the matching user/pet pairs. Why not do a natural inner join between the user and user_pet tables and a natural inner join between the resulting relation and the pet table? Something like this, which I'm writing completely from the top of my head, so I'm not sure it would work: [code]SELECT username, name FROM user INNER JOIN user_pet INNER JOIN pet[/code]
2009-02-08 16:05:46 Raul Pedro Santos said:
Oops, looks like the comments don't allow bbcode. Sorry. This brings up another suggestion: enable comment preview, so your visitors can make sure their comments will show up as they intended them to :)
2009-02-08 16:27:53 Raul Pedro Santos said:
By the way, I meant NATURAL JOIN and not INNER JOIN. Also, for this to work, column names had to be consistent across tables. That is, the "id" column in both the "user" and the "pet" tables should be named "user_id" and "pet_id", like the columns in the "user_pet" table. Specifically, the last query you wrote before introducing the bitmask change, would be this: SELECT username, group_concat(name) FROM user NATURAL JOIN user_pet NATURAL JOIN pet GROUP BY username;