<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Maggie Nelson &#187; denormalization</title>
	<atom:link href="http://maggienelson.com/tag/denormalization/feed/" rel="self" type="application/rss+xml" />
	<link>http://maggienelson.com</link>
	<description>databases and code goodness</description>
	<lastBuildDate>Tue, 06 Apr 2010 17:24:02 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Denormalization with Bitmasks</title>
		<link>http://maggienelson.com/2009/02/denormalization-with-bitmasks/</link>
		<comments>http://maggienelson.com/2009/02/denormalization-with-bitmasks/#comments</comments>
		<pubDate>Sun, 08 Feb 2009 03:48:43 +0000</pubDate>
		<dc:creator>maggie</dc:creator>
				<category><![CDATA[entry]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[denormalization]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[tricks]]></category>

		<guid isPermaLink="false">http://maggienelson.com/?p=168</guid>
		<description><![CDATA[This is an oldie, but goodie.
A lot of times you&#8217;ll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.
This will either cause [...]]]></description>
			<content:encoded><![CDATA[<p>This is an oldie, but goodie.</p>
<p>A lot of times you&#8217;ll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.</p>
<p>This will either cause queries in loops (do a SELECT for every row returned from the first SELECT).  Or you might find yourself writing overly complicated <a href="http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:31263576751669">pivot queries</a>.  (My personal choice has always been writing my own global Oracle functions for data aggregation &#8211; talk about overkill!)</p>
<p>Even if you have great hardware, your database will inevitably become your bottleneck, so take care of your database: design and build to avoid unnecessary database access and once you&#8217;re in the database, get your data as fast as possible and get the heck out!</p>
<p>Back to basics is your solution and there&#8217;s really not many things more basic than <a href="http://en.wikipedia.org/wiki/Mask_(computing)">bitmasks</a>.  Let&#8217;s say you have a social networking site (who doesn&#8217;t these days?).  When users register (or at some point later on), they can specify what kind of pets they own.  You can use this data to perhaps match those users &#8211; cat lovers like other cat lovers, right?  (*shifty eyes*)</p>
<p>This is how the relationship between users and pets might look like in your database:</p>
<p style="text-align: center;"><img class="size-full wp-image-172" title="user_pet_erd" src="http://maggienelson.com/blog/wp-content/uploads/2009/02/user_pet.png" alt="user_pet_erd" width="366" height="141" /></p>
<p>Let&#8217;s say you have the following data in those tables:</p>
<p>user:</p>
<pre>+----+----------+
| id | username |
+----+----------+
|  1 | Maggie   |
|  2 | Sully    |
+----+----------+</pre>
<p>pet:</p>
<pre>+----+------------+
| id | name       |
+----+------------+
|  1 | cat        |
|  2 | dog        |
|  3 | lizard     |
|  4 | parakeet   |
|  5 | guinea pig |
|  6 | snake      |
|  7 | unicorn    |
|  8 | ferret     |
+----+------------+</pre>
<p>user_pet:</p>
<pre>+---------+--------+
| user_id | pet_id |
+---------+--------+
|       1 |      1 |
|       1 |      4 |
|       1 |      5 |
|       2 |      7 |
|       2 |      8 |
+---------+--------+</pre>
<p>How do I find out easily what kinds of pets my users have?  Oh, it&#8217;s easy:</p>
<p>Maggie has:</p>
<pre>select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Maggie'
   and up.user_id = u.id
   and p.id = up.pet_id;

+------------+
| name       |
+------------+
| cat        |
| parakeet   |
| guinea pig |
+------------+</pre>
<p>And Sully has*:</p>
<pre>select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Sully'
   and up.user_id = u.id
   and p.id = up.pet_id;

+---------+
| name    |
+---------+
| unicorn |
| ferret  |
+---------+</pre>
<p>* Of course <a href="http://en.wikipedia.org/wiki/Chesley_Sullenberger">Sully</a> has a unicorn!  (He also poops nanchucks.)</p>
<p>It&#8217;s easy to find out what kind of pets users have one at a time.  Let&#8217;s do it in one big swoop though:</p>
<pre>
select u.username,
       p.name
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id;

+----------+------------+
| username | name       |
+----------+------------+
| Maggie   | cat        |
| Maggie   | parakeet   |
| Maggie   | guinea pig |
| Sully    | unicorn    |
| Sully    | ferret     |
+----------+------------+
</pre>
<p>You get all the data you wanted, however, it&#8217;s all spread over many rows; some data aggregation is required!  You&#8217;re probably thinking to yourself: &#8220;Oh, man, if only there were a function that works just like SUM() but for strings!&#8221;.  If you&#8217;re using MySQL, you&#8217;re in luck thanks to the awesome <a href="http://dev.mysql.com/doc/refman/4.1/en/group-by-functions.html#function_group-concat">GROUP_CONCAT()</a> function.  If you use it, you&#8217;ll get this:</p>
<pre>
select u.username, group_concat(p.name)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-------------------------+
| username | group_concat(p.name)    |
+----------+-------------------------+
| Maggie   | cat,guinea pig,parakeet |
| Sully    | ferret,unicorn          |
+----------+-------------------------+
</pre>
<p>Pretty swanky, eh?  However, while group_concat() is totally awesome (oh, I love it so!), it&#8217;s only available in MySQL.  (Although comments on <a href="http://db4free.blogspot.com/2006/01/hail-to-groupconcat.html">this group_concat() praising blog post</a> have an example of how to accomplish group_concat() in postgress, which you could also achieve in Oracle.)  But I digress.  Also, once you have the string representing all the pets, you&#8217;ll need to parse it in your application to get the pet names.  And most importantly, this data is denormalized in a way that makes it a little difficult to enforce data integrity once you modify the list.</p>
<p>If group_concat() didn&#8217;t exist, what else can you do?  Thinking of group_concat() as a sum() but for strings instead of numbers is an interesting approach.  Next step: what do you have available out there that would yield a sum for a combination of values that could then be reverse engineered to get that unique combination of values again?  That&#8217;s right, bitmasks!</p>
<p>How do you implement this in your user-pet scenario?  It&#8217;s easy: first, represent bits in your bitmask as decimals on which you can later do math.  Use a table to keep track of the bit-to-decimal translation for additional data integrity. Use those decimals as keys for your pets. </p>
<p>First, I create a table named power_of_two with the following values:</p>
<pre>
+----------+-------+
| exponent | power |
+----------+-------+
|        0 |     1 |
|        1 |     2 |
|        2 |     4 |
|        3 |     8 |
|        4 |    16 |
|        5 |    32 |
|        6 |    64 |
|        7 |   128 |
+----------+-------+
</pre>
<p>Then I use those values as IDs in the pet table:</p>
<pre>
+-----+------------+
| id  | name       |
+-----+------------+
|   1 | iguana     |
|   2 | cat        |
|   4 | dog        |
|   8 | lizard     |
|  16 | parakeet   |
|  32 | guinea pig |
|  64 | snake      |
| 128 | unicorn    |
| 256 | ferret     |
+-----+------------+
</pre>
<p>Note that I added an iguana as pet with id of 1 &#8211; this is 2 to the 0th power.  I needed to add a pet at this spot to account for the rightmost place in my bitmasks (otherwise I have to shift everything by 1, which can be confusing).</p>
<p>We have 9 possible pets.  If I have no pets, my bitmask will be 000000000 &#8211; or 0.  If I have the first and the third pet, my bitmask will be 000000101 &#8211; or 5.</p>
<p>Assuming these changes to my database, let&#8217;s see what kinds of pets Maggie and Sully have:</p>
<pre>
select u.username, sum(p.id)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-----------+
| username | sum(p.id) |
+----------+-----------+
| Maggie   |        50 |
| Sully    |       384 |
+----------+-----------+
</pre>
<p>50 = 0b110010 &#8211; so Maggie has the 2nd, 4th and 5th pet. 384 = 0b110000000, so Sully has the 8th and the 9th pets.</p>
<p>Let&#8217;s assume your application heavily caches data that doesn&#8217;t change often &#8211; such as the pet table (not user_pet).  Let&#8217;s say you have this cached:</p>
<pre>
$pets = array(
    1 => 'iguana',
    2 => 'cat',
    4 => 'dog',
    8 => 'lizard',
    16 => 'parakeet',
    32 => 'guinea pig',
    64 => 'snake',
    128 => 'unicorn',
    256 => 'ferret'
);
</pre>
<p>If your application already knows this:</p>
<p>Maggie: 50<br />
Sully: 384</p>
<p>Then with a clever use of PHP&#8217;s <a href="http://us.php.net/decbin">decbin()</a>, it should be really easy to display the right pet names.</p>
<p>This approach is great for when you have a very limited number of times you can connect to the database and once you&#8217;re there, you don&#8217;t have the luxury of running expensive aggregate queries.  Also, web servers are easier to scale than coming up with error-proof database scaling solutions.  I&#8217;ve also found this approach extremely useful when migrating lots of data about users (think millions of rows) from one system to another.  Additionally, by having the power_of_two table, you&#8217;re able to be somewhat strict about the possible values for the user_pet table &#8211; this gives you slightly more data integrity than just using the concatenated string of names.</p>
<p>But the best part about using bitmasks is that you can do math on them!  What&#8217;s easier to compare:</p>
<p>iguana,cat,dog,guinea pig,unicorn,ferret<br />
to<br />
iguana,cat,guinea pig,snake,unicorn,ferret</p>
<p>Here you have to possibly split the string on the comma into an array, then do array comparisons.</p>
<p>OR</p>
<p>110100111<br />
111100011</p>
<p>Here you can use a bitwise AND to figure out the overlap!</p>
<p>Better yet, you can easily give people more pets!</p>
<p>110100111<br />
OR<br />
111111111</p>
<p>And now that person has all pets.  And you didn&#8217;t even have to check if they had them before!  Hooray for bitmasks!</p>
<p>Remember, as with many other optimization techniques, this one will not be appropriate for every scenario, but when you do need it, I promise it will work very very well!</p>
<p>P.S. I found it ironic to talk about a function that MySQL has but not Oracle &#8211; it&#8217;s usually the other way around.  The str_agg (string aggregate) function I&#8217;ve had to write is so clunky &#8211; Oracle, please implement group_concat()!</p>
]]></content:encoded>
			<wfw:commentRss>http://maggienelson.com/2009/02/denormalization-with-bitmasks/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
	</channel>
</rss>
