Maggie Nelson

databases and code goodness

  • Author: maggie
  • Published: Feb 26th, 2009
  • Category: entry
  • Comments: 5

Finally, pretty syntax highlighting for blog posts!

Tags: , , , , ,

A coworker, Craig Campbell just launched a new blog recently. One of the really neat things about his blog is how he handles syntax highlighting for code samples – check out examples in his interesting post about Cool Object Building with PHP. In fact, he got so many good comments about this approach that he even wrote a post explaining exactly how he does it: Syntax Highlighting for Your Blog Using TextMate. (It even works in Internet Exploder!)

I’ve been looking for a good solution for syntax highlighting for code samples I post on this blog since my move to WordPress, so his solution certainly looks appealing. I use vim as my main editor and I’ve been staying away from TextMate for quite some time. However, there seems to be a lot of interest in creating vi plugins for TextMate that would emulate vi behavior, such as TextMate vi Plugin. If you’ve ever used and loved vi/vim, you know that it’s hard to move to an editor that requires the use of a mouse for text editing. I finally caved in 2 years ago and moved from Linux to Mac as my primary development environment, so perhaps checking out TextMate is a good idea.

I’m still keeping pine/alpine for reading e-mail though!

  • Author: maggie
  • Published: Feb 22nd, 2009
  • Category: entry
  • Comments: 2

The Meaning of Keys

Tags: , , , ,

If you’ve taken that Database Design 101 class in college, you probably have learned that containing meaning in primary keys is not a good idea. The example that’s often given is a situation where a bank keeps track of its customers by associating them with a branch where they opened the account. This of course causes problems if the user moves to a different branch or if a branch closes altogether. These are pretty drastic scenarios and as developers, we are quickly scared away from putting meaning in keys. Or are we?

It is my personal opinion that SQL is one of the most human friendly languages. I’ve often fixed or helped fix problems with complicated queries by stepping back and explaining in human words what it is I want to retrieve from the database. My sociology knowledge narrows down to one class in college and my psychology experience is nonexistent, so these statements are pure conjecture: SQL, just as much as Venn diagrams simply make sense to people. As such, many times developers will make the leap from what they know to what they suppose is true (after all, we’re not Vulcans).

One big example of such behavior is MySQL’s AUTO_INCREMENT feature. AUTO_INCREMENT is usually used on a table field which becomes the primary key. Whenever a new record is entered, the field is assigned a new value, higher than the previous highest value in that field. Now, there is absolutely no reason a primary key should contain any notion of where in the series of records it lives.

A primary key only has one purpose: to uniquely identify a record in a table.

While using AUTO_INCREMENT gives you really nice features (like LAST_INSERT_ID() function or helping to avoid errors in a replicated system), for the purposes of building a primary key, it’s kind of overkill. All a primary key requires is to be unique.

Taking a step back: SQL was always meant to be very strict and have a very strict set of standards. Vendors who implemented SQL (such as MySQL, Oracle) are supposed to comply with these standards and certify that they comply with them. The standard SQL has become known as ANSI SQL in the late 1980’s as it was then when the American National Standards Institute adopted SQL as a standard. Unfortunately, this standard is not very well documented and/or the documentation of the standard is not easy to find. Given that there is a lot of dough to be made in software development and the enterpreneur nature of software companies, each SQL implementation slowly veered away from the standard, adding functions, finding new ways of solving problems, modifying things here and there. Depending on who you ask, these changes are portrayed as useful features added by the vendors or traps to lock clients to a specific RDBMS.

So how do all this politicking affect our primary keys? We simply now have pretty ways of creating them. In MySQL you have the already mentioned AUTO_INCREMENT. In Oracle, you can create a SEQUENCE which is a construct that holds ordered integers. Using it is very similar to AUTO_INCREMENT. You can also use the SYS_GUID() function which generates 16-byte long globally unique identifier (”GUID” is really a Microsoft word, what this function generates is really a UUID, but we can talk about the politics of this part another time). These values may look like this:

LOCATION_ID UID_COL
----------- ----------------------------------------
       1000 7CD5B7769DF75CEFE034080020825436
       1100 7CD5B7769DF85CEFE034080020825436
       1200 7CD5B7769DF95CEFE034080020825436
       1300 7CD5B7769DFA5CEFE034080020825436

On principle, I love using SYS_GUID() for primary keys. It is a true primary key with no meaning. But in practice, they’re a giant pain in the ass when you’re debugging queries or just want to quickly see what’s in a table. I think the PITA-ness of the UUIDs here is that they’re so hard for humans to read and parse. They add a layer of translation you have to do before you understand the data – this kind of goes against the idea that SQL is so easy to understand because it reads like a natural language.

Another side effect is the mixing of data retrieval with the creating of database objects. Often I stress the following:

  • SQL: Structured Query Language – retrieval of data. Think SELECT.
  • DML: Data Modification Language – modifying data. Think INSERT, UPDATE, DELETE.
  • DDL: Data Definition Language – structure of data. Think CREATE, DROP, ALTER.

The ANSI SQL contains these three distinctions and standard SQL follows them. The reason for making a distinction is that mixing these three in any combination are mostly data integrity. You shouldn’t change the table structure at the same time as retrieving data. You shouldn’t modify data in a table when you’re also retrieving from it. The RDBMS will manage for you the distinction, so don’t worry about them, until of course, the RDBMS itself bypasses the rules.

Before databases offered fancy ways to increment values, you had to roll your own. You’d basically create a counter table that would be updated every time a new incremented number was required. Remember rollback and commit? Those only apply to DML – whenever you change data (but not data structure). Keeping all your data, including the primary key generation, as DML allows for keeping all your transactions atomic. Whenever you insert a new row into the main table, the counter table gets updated. You can rollback or commit the change in one step. It also means you’re less likely to encounter gaps in your counter and you can easily retrieve the maximum counter values.

MySQL takes this a little further with AUTO_INCREMENT. If you try to create a field in a table that is incremented automatically, but you don’t define it as a key, you’ll get the following error:

MySQL: ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key

Oracle has sequences, which are separate constructs. Values are accessed as:

MY_SEQUENCE.next_val()

Every time you get a new value, it’s used and has weird behavior in terms of rollbacks and commits that may result in gaps in the sequence.

SYS_GUID(), as I mentioned, is totally awesome (if it weren’t such a pain). But it returns a new unique value every time. This is great for some scenarios, but not useful for others.

Over past few months, there’s been blog posts popping up here and there with ominous titles such as Is the Relational Database Doomed? and others indicating an end of an era. I think what most of these posts are really a knee-jerk reaction to is that high levels of normalization are kind of unnatural to human beings. Great for computers, but unnatural (quick, someone make a joke about natural joins!) Perhaps the next generation of database courses will teach that high levels of normalization are not good at all – data integrity may be important, but being able to retrieve that data really fast is far more rewarded. Think of it this way: keeping a list of all articles on your blog is really important, but when’s the last time someone looked at your blog post from 3 years ago?

While “premature optimization is the root of all evil”, sometimes optimizing in the very early stages is good as well. This especially rings true for systems that expect a lot of data, the kind that knows that they need to be horizontally scalable without single points of failure. Heavily normalized databases are often viwed as an obstacle that needs to be overcome (slow) instead of a benefit (you can ensure the integrity of data).

Most current database optimization techniques such as table partitioning, table sharding (withing 1 or many databases) and database replication depend on the meaning of keys or some other field to make the decision where the data goes. Perhaps all users whose usernames start with A live in db01. Or maybe all users in a certain region are spread evenly over four database machines, while all other users, not being the target market, get their own wimpy machine? And the example that’s often used to scared developers: perhaps it is a good idea to keep all customers of a branch separate from other branches – yes, you’ll take a hit if a branch closes or a customer moves to a new branch, but perhaps we know how to handle it outside of the database?

As you can see: keys with meaning aren’t good or bad – they simply serve a purpose. It is up to you, as a developer, to decide the best tool for the job. Good luck!

  • Author: maggie
  • Published: Feb 7th, 2009
  • Category: entry
  • Comments: 21

Denormalization with Bitmasks

Tags: , , , , ,

This is an oldie, but goodie.

A lot of times you’ll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.

This will either cause queries in loops (do a SELECT for every row returned from the first SELECT). Or you might find yourself writing overly complicated pivot queries. (My personal choice has always been writing my own global Oracle functions for data aggregation – talk about overkill!)

Even if you have great hardware, your database will inevitably become your bottleneck, so take care of your database: design and build to avoid unnecessary database access and once you’re in the database, get your data as fast as possible and get the heck out!

Back to basics is your solution and there’s really not many things more basic than bitmasks. Let’s say you have a social networking site (who doesn’t these days?). When users register (or at some point later on), they can specify what kind of pets they own. You can use this data to perhaps match those users – cat lovers like other cat lovers, right? (*shifty eyes*)

This is how the relationship between users and pets might look like in your database:

user_pet_erd

Let’s say you have the following data in those tables:

user:

+----+----------+
| id | username |
+----+----------+
|  1 | Maggie   |
|  2 | Sully    |
+----+----------+

pet:

+----+------------+
| id | name       |
+----+------------+
|  1 | cat        |
|  2 | dog        |
|  3 | lizard     |
|  4 | parakeet   |
|  5 | guinea pig |
|  6 | snake      |
|  7 | unicorn    |
|  8 | ferret     |
+----+------------+

user_pet:

+---------+--------+
| user_id | pet_id |
+---------+--------+
|       1 |      1 |
|       1 |      4 |
|       1 |      5 |
|       2 |      7 |
|       2 |      8 |
+---------+--------+

How do I find out easily what kinds of pets my users have? Oh, it’s easy:

Maggie has:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Maggie'
   and up.user_id = u.id
   and p.id = up.pet_id;

+------------+
| name       |
+------------+
| cat        |
| parakeet   |
| guinea pig |
+------------+

And Sully has*:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Sully'
   and up.user_id = u.id
   and p.id = up.pet_id;

+---------+
| name    |
+---------+
| unicorn |
| ferret  |
+---------+

* Of course Sully has a unicorn! (He also poops nanchucks.)

It’s easy to find out what kind of pets users have one at a time. Let’s do it in one big swoop though:

select u.username,
       p.name
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id;

+----------+------------+
| username | name       |
+----------+------------+
| Maggie   | cat        |
| Maggie   | parakeet   |
| Maggie   | guinea pig |
| Sully    | unicorn    |
| Sully    | ferret     |
+----------+------------+

You get all the data you wanted, however, it’s all spread over many rows; some data aggregation is required! You’re probably thinking to yourself: “Oh, man, if only there were a function that works just like SUM() but for strings!”. If you’re using MySQL, you’re in luck thanks to the awesome GROUP_CONCAT() function. If you use it, you’ll get this:

select u.username, group_concat(p.name)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-------------------------+
| username | group_concat(p.name)    |
+----------+-------------------------+
| Maggie   | cat,guinea pig,parakeet |
| Sully    | ferret,unicorn          |
+----------+-------------------------+

Pretty swanky, eh? However, while group_concat() is totally awesome (oh, I love it so!), it’s only available in MySQL. (Although comments on this group_concat() praising blog post have an example of how to accomplish group_concat() in postgress, which you could also achieve in Oracle.) But I digress. Also, once you have the string representing all the pets, you’ll need to parse it in your application to get the pet names. And most importantly, this data is denormalized in a way that makes it a little difficult to enforce data integrity once you modify the list.

If group_concat() didn’t exist, what else can you do? Thinking of group_concat() as a sum() but for strings instead of numbers is an interesting approach. Next step: what do you have available out there that would yield a sum for a combination of values that could then be reverse engineered to get that unique combination of values again? That’s right, bitmasks!

How do you implement this in your user-pet scenario? It’s easy: first, represent bits in your bitmask as decimals on which you can later do math. Use a table to keep track of the bit-to-decimal translation for additional data integrity. Use those decimals as keys for your pets.

First, I create a table named power_of_two with the following values:

+----------+-------+
| exponent | power |
+----------+-------+
|        0 |     1 |
|        1 |     2 |
|        2 |     4 |
|        3 |     8 |
|        4 |    16 |
|        5 |    32 |
|        6 |    64 |
|        7 |   128 |
+----------+-------+

Then I use those values as IDs in the pet table:

+-----+------------+
| id  | name       |
+-----+------------+
|   1 | iguana     |
|   2 | cat        |
|   4 | dog        |
|   8 | lizard     |
|  16 | parakeet   |
|  32 | guinea pig |
|  64 | snake      |
| 128 | unicorn    |
| 256 | ferret     |
+-----+------------+

Note that I added an iguana as pet with id of 1 – this is 2 to the 0th power. I needed to add a pet at this spot to account for the rightmost place in my bitmasks (otherwise I have to shift everything by 1, which can be confusing).

We have 9 possible pets. If I have no pets, my bitmask will be 000000000 – or 0. If I have the first and the third pet, my bitmask will be 000000101 – or 5.

Assuming these changes to my database, let’s see what kinds of pets Maggie and Sully have:

select u.username, sum(p.id)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-----------+
| username | sum(p.id) |
+----------+-----------+
| Maggie   |        50 |
| Sully    |       384 |
+----------+-----------+

50 = 0b110010 – so Maggie has the 2nd, 4th and 5th pet. 384 = 0b110000000, so Sully has the 8th and the 9th pets.

Let’s assume your application heavily caches data that doesn’t change often – such as the pet table (not user_pet). Let’s say you have this cached:

$pets = array(
    1 => 'iguana',
    2 => 'cat',
    4 => 'dog',
    8 => 'lizard',
    16 => 'parakeet',
    32 => 'guinea pig',
    64 => 'snake',
    128 => 'unicorn',
    256 => 'ferret'
);

If your application already knows this:

Maggie: 50
Sully: 384

Then with a clever use of PHP’s decbin(), it should be really easy to display the right pet names.

This approach is great for when you have a very limited number of times you can connect to the database and once you’re there, you don’t have the luxury of running expensive aggregate queries. Also, web servers are easier to scale than coming up with error-proof database scaling solutions. I’ve also found this approach extremely useful when migrating lots of data about users (think millions of rows) from one system to another. Additionally, by having the power_of_two table, you’re able to be somewhat strict about the possible values for the user_pet table – this gives you slightly more data integrity than just using the concatenated string of names.

But the best part about using bitmasks is that you can do math on them! What’s easier to compare:

iguana,cat,dog,guinea pig,unicorn,ferret
to
iguana,cat,guinea pig,snake,unicorn,ferret

Here you have to possibly split the string on the comma into an array, then do array comparisons.

OR

110100111
111100011

Here you can use a bitwise AND to figure out the overlap!

Better yet, you can easily give people more pets!

110100111
OR
111111111

And now that person has all pets. And you didn’t even have to check if they had them before! Hooray for bitmasks!

Remember, as with many other optimization techniques, this one will not be appropriate for every scenario, but when you do need it, I promise it will work very very well!

P.S. I found it ironic to talk about a function that MySQL has but not Oracle – it’s usually the other way around. The str_agg (string aggregate) function I’ve had to write is so clunky – Oracle, please implement group_concat()!

  • Author: maggie
  • Published: Jan 25th, 2009
  • Category: entry
  • Comments: 6

From MovableType to WordPress in 301 Easy Steps

Tags:

I’m finally caving in and switching to WordPress.  I tried it long long time ago and was annoyed by some of its code (it’s not always pretty) and the fact that pages are never cached – there are a lot of on-demand database requests.  But it’s been a while and things seem to have improved – some within WordPress itself, but others due to the help of the community which has provided tons of plugins that can help WordPress get around some of its problems.

The move from MovableType to WordPress was easy.  WordPress has import functionality that plays very nicely with MovableType’s exported files.  Yay!

What got me, though, is how to handle making sure that visitors to my old site would know to come to the new site.  Long time ago, I’d think to just take everything down and have a page saying “stuff has moved, deal with it, here’s the new URL”, but that’s not really very user friendly.  Michael Bloch points out here the not-so-good approaches for moving your site and recommends using the HTTP status code 301: Moved Permanently.  This will make everything magially work: users will be redirected properly and search engines will know that the resources have moved and start looking for them in the new places.

Read the rest of this entry »

Seven Things

Tags:

Everyone likes an Interweb meme! I was tagged by Elizabeth Naramore (thanks! and my jokes are totally awesome!) with Seven Things, so I’m supposed to tell you seven things about me. So here we go!

  • I still drive my first car: an awesome ‘92 Ford Crown Victoria, silver with stylish red vinyl on the inside. It caught on fire once but we put it out with a cup of coffee (the amazing part being convincing a Seattle native to surrender his coffee for any cause).
  • I originally wanted to study math in college, but then I found a job at an Interwebs startup called ScreamingMedia, where I learned PHP (yay, .php3 files!). Eventually, I switched my major to computer science.
  • I once worked at Amazon.com, on the Books/Music/Videos/DVD parts of the site where the QA department featured a real rocket scientist (she used to work for NASA).
  • I know how to construct a trebuchet, use a blow torch and have built a small remote-controlled solar-powered car once.
  • My first interaction with a computer was on a green screen, typing dir *.*
  • I had a story published in The New York Times once (in the “Metropolitan Diary” column, but still!)
  • I miss the tiled website backgrounds of the late 1990’s.

I am hereby tagging these individuals:

  • Matt Purdon for always staying on top of new technology and never settling for comfort when coding.
  • Cal Evans for always starting his day with a cheerful Twitter message.
  • Paul Reinheimer for bringing the meat from America’s Hat.
  • Ed Finkler for nicer web design than the rest of us combined!
  • Brandon Savage for pointing out that I can’t count to 7!

These are the rules apparently:

  • Link your original tagger(s), and list these rules on your blog.
  • Share seven facts about yourself in the post – some random, some wierd.
  • Tag seven people at the end of your post by leaving their names and the links to their blogs.
  • Let them know they’ve been tagged by leaving a comment on their blogs and/or Twitter.

© 2010 Maggie Nelson. All Rights Reserved.

This blog is powered by the Wordpress platform and beach rentals.