Maggie Nelson

databases and code goodness

  • Author: maggie
  • Published: Jul 14th, 2009
  • Category: entry
  • Comments: 5

More distributed key/value storage options

Tags: , , , , , , ,

CouchDB has infected me and I’ve been reading a lot about alternative ways to store data AND organize it. In the midst of options for alternatives to relational databases, these two stand out:

Cassandra – “Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.”

The huge appeal of Cassandra seems to be the approach to make it highly fault-tolerant. Writes never fail. Data is always available. No single point of failure. If you’re making a Twitter-like app, you should consider it.

Tokyo Cabinet – “Tokyo Cabinet uses hash algorithm to retrieve records. If a bucket array has sufficient number of elements, the time complexity of retrieval is “O(1)”. That is, time required for retrieving a record is constant, regardless of the scale of a database. It is also the same about storing and deleting. Collision of hash values is managed by separate chaining. Data structure of the chains is binary search tree. Even if a bucket array has unusually scarce elements, the time complexity of retrieval is “O(log n)”.”

Tokyo Cabinet is slightly newer and is apparently stupidly fast, faster than any other storage solutions out there (at least for now). It’s written in C and provided as API of C, Perl, Ruby, Java, and Lua.

Do you know anyone who has used these two already? Care to share your experiences?

  • Author: maggie
  • Published: Apr 23rd, 2009
  • Category: entry
  • Comments: 1

Weekend Reading: Fun with Data and Statistics

Tags: , , ,

I know, I know, it’s only Thursday, but a girl can dream, right?

At work, I design a lot of database systems that manage a lot of data. Most of these systems are put in front of real human beings who are expected to find meaningful data in a big big pile of it. The two main approaches are to use either a harsh, editorial-driven, curated system such as a category hierarchy (Rock falls under Music falls under Entertainment) or have a completely free-flowing, user-generated system such as tagging or description search. But in either case, there’s always something missing – you pick tagging, you wish people didn’t tag things with “boobies” so much. You pick a strict category structure and it just feels too restrictive. So what can you do? The March/April 2009 issue of IEEE Intelligent Systems Magazine has an article Unreasonable Effectiveness of Data.

We should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

The article broke my brain a little bit, but go read it, it’s interesting nevertheless.

While we’re talking about representations of data, go read about the Semantic Web – how can we tell computers and teh internets what we humans want?

If you want a little bit lighter reading, go read Bill Bryson’s books about language, specifically Mother Tongue and Made In America. Reading anything by Bill Bryson will make you a better person (or your money back).

Once you have your data, someone will inevitably ask to tell them what’s “popular”. I’m putting it in quotes, because it means so many things to so many people. Before you answer, learn a little bit about statistics. I recommend Statistics in a Nutshell from O’Reilly. Hint: “most popular” does not always mean “has most views”.

For some real-life scenarios of statistics, misuse of statistics, problems with polling plus a nice dose of politics, read Nate Silver’s FiveThirtyEight.com blog. He’s also a partner and analyst for Baseball Prospectus – you might fight baseball boring, but boy, does it lend itself toward awesome stats gathering and mangling. Reading the two might not be immediately applicable to software developers, but it’ll put your mind in a right context when trying to get meaning out of your giant pile of data.

I will expect your book reports by Monday.

  • Author: maggie
  • Published: Feb 22nd, 2009
  • Category: entry
  • Comments: 2

The Meaning of Keys

Tags: , , , ,

If you’ve taken that Database Design 101 class in college, you probably have learned that containing meaning in primary keys is not a good idea. The example that’s often given is a situation where a bank keeps track of its customers by associating them with a branch where they opened the account. This of course causes problems if the user moves to a different branch or if a branch closes altogether. These are pretty drastic scenarios and as developers, we are quickly scared away from putting meaning in keys. Or are we?

It is my personal opinion that SQL is one of the most human friendly languages. I’ve often fixed or helped fix problems with complicated queries by stepping back and explaining in human words what it is I want to retrieve from the database. My sociology knowledge narrows down to one class in college and my psychology experience is nonexistent, so these statements are pure conjecture: SQL, just as much as Venn diagrams simply make sense to people. As such, many times developers will make the leap from what they know to what they suppose is true (after all, we’re not Vulcans).

One big example of such behavior is MySQL’s AUTO_INCREMENT feature. AUTO_INCREMENT is usually used on a table field which becomes the primary key. Whenever a new record is entered, the field is assigned a new value, higher than the previous highest value in that field. Now, there is absolutely no reason a primary key should contain any notion of where in the series of records it lives.

A primary key only has one purpose: to uniquely identify a record in a table.

While using AUTO_INCREMENT gives you really nice features (like LAST_INSERT_ID() function or helping to avoid errors in a replicated system), for the purposes of building a primary key, it’s kind of overkill. All a primary key requires is to be unique.

Taking a step back: SQL was always meant to be very strict and have a very strict set of standards. Vendors who implemented SQL (such as MySQL, Oracle) are supposed to comply with these standards and certify that they comply with them. The standard SQL has become known as ANSI SQL in the late 1980’s as it was then when the American National Standards Institute adopted SQL as a standard. Unfortunately, this standard is not very well documented and/or the documentation of the standard is not easy to find. Given that there is a lot of dough to be made in software development and the enterpreneur nature of software companies, each SQL implementation slowly veered away from the standard, adding functions, finding new ways of solving problems, modifying things here and there. Depending on who you ask, these changes are portrayed as useful features added by the vendors or traps to lock clients to a specific RDBMS.

So how do all this politicking affect our primary keys? We simply now have pretty ways of creating them. In MySQL you have the already mentioned AUTO_INCREMENT. In Oracle, you can create a SEQUENCE which is a construct that holds ordered integers. Using it is very similar to AUTO_INCREMENT. You can also use the SYS_GUID() function which generates 16-byte long globally unique identifier (”GUID” is really a Microsoft word, what this function generates is really a UUID, but we can talk about the politics of this part another time). These values may look like this:

LOCATION_ID UID_COL
----------- ----------------------------------------
       1000 7CD5B7769DF75CEFE034080020825436
       1100 7CD5B7769DF85CEFE034080020825436
       1200 7CD5B7769DF95CEFE034080020825436
       1300 7CD5B7769DFA5CEFE034080020825436

On principle, I love using SYS_GUID() for primary keys. It is a true primary key with no meaning. But in practice, they’re a giant pain in the ass when you’re debugging queries or just want to quickly see what’s in a table. I think the PITA-ness of the UUIDs here is that they’re so hard for humans to read and parse. They add a layer of translation you have to do before you understand the data – this kind of goes against the idea that SQL is so easy to understand because it reads like a natural language.

Another side effect is the mixing of data retrieval with the creating of database objects. Often I stress the following:

  • SQL: Structured Query Language – retrieval of data. Think SELECT.
  • DML: Data Modification Language – modifying data. Think INSERT, UPDATE, DELETE.
  • DDL: Data Definition Language – structure of data. Think CREATE, DROP, ALTER.

The ANSI SQL contains these three distinctions and standard SQL follows them. The reason for making a distinction is that mixing these three in any combination are mostly data integrity. You shouldn’t change the table structure at the same time as retrieving data. You shouldn’t modify data in a table when you’re also retrieving from it. The RDBMS will manage for you the distinction, so don’t worry about them, until of course, the RDBMS itself bypasses the rules.

Before databases offered fancy ways to increment values, you had to roll your own. You’d basically create a counter table that would be updated every time a new incremented number was required. Remember rollback and commit? Those only apply to DML – whenever you change data (but not data structure). Keeping all your data, including the primary key generation, as DML allows for keeping all your transactions atomic. Whenever you insert a new row into the main table, the counter table gets updated. You can rollback or commit the change in one step. It also means you’re less likely to encounter gaps in your counter and you can easily retrieve the maximum counter values.

MySQL takes this a little further with AUTO_INCREMENT. If you try to create a field in a table that is incremented automatically, but you don’t define it as a key, you’ll get the following error:

MySQL: ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key

Oracle has sequences, which are separate constructs. Values are accessed as:

MY_SEQUENCE.next_val()

Every time you get a new value, it’s used and has weird behavior in terms of rollbacks and commits that may result in gaps in the sequence.

SYS_GUID(), as I mentioned, is totally awesome (if it weren’t such a pain). But it returns a new unique value every time. This is great for some scenarios, but not useful for others.

Over past few months, there’s been blog posts popping up here and there with ominous titles such as Is the Relational Database Doomed? and others indicating an end of an era. I think what most of these posts are really a knee-jerk reaction to is that high levels of normalization are kind of unnatural to human beings. Great for computers, but unnatural (quick, someone make a joke about natural joins!) Perhaps the next generation of database courses will teach that high levels of normalization are not good at all – data integrity may be important, but being able to retrieve that data really fast is far more rewarded. Think of it this way: keeping a list of all articles on your blog is really important, but when’s the last time someone looked at your blog post from 3 years ago?

While “premature optimization is the root of all evil”, sometimes optimizing in the very early stages is good as well. This especially rings true for systems that expect a lot of data, the kind that knows that they need to be horizontally scalable without single points of failure. Heavily normalized databases are often viwed as an obstacle that needs to be overcome (slow) instead of a benefit (you can ensure the integrity of data).

Most current database optimization techniques such as table partitioning, table sharding (withing 1 or many databases) and database replication depend on the meaning of keys or some other field to make the decision where the data goes. Perhaps all users whose usernames start with A live in db01. Or maybe all users in a certain region are spread evenly over four database machines, while all other users, not being the target market, get their own wimpy machine? And the example that’s often used to scared developers: perhaps it is a good idea to keep all customers of a branch separate from other branches – yes, you’ll take a hit if a branch closes or a customer moves to a new branch, but perhaps we know how to handle it outside of the database?

As you can see: keys with meaning aren’t good or bad – they simply serve a purpose. It is up to you, as a developer, to decide the best tool for the job. Good luck!

  • Author: maggie
  • Published: Feb 7th, 2009
  • Category: entry
  • Comments: 21

Denormalization with Bitmasks

Tags: , , , , ,

This is an oldie, but goodie.

A lot of times you’ll find yourself retrieving lists of records, e.g. a list of users on your site, only to find out that each of those records, i.e. each user requires a retrieval of another list of records: roles and permissions, hobbies, pets, preferred languages etc.

This will either cause queries in loops (do a SELECT for every row returned from the first SELECT). Or you might find yourself writing overly complicated pivot queries. (My personal choice has always been writing my own global Oracle functions for data aggregation – talk about overkill!)

Even if you have great hardware, your database will inevitably become your bottleneck, so take care of your database: design and build to avoid unnecessary database access and once you’re in the database, get your data as fast as possible and get the heck out!

Back to basics is your solution and there’s really not many things more basic than bitmasks. Let’s say you have a social networking site (who doesn’t these days?). When users register (or at some point later on), they can specify what kind of pets they own. You can use this data to perhaps match those users – cat lovers like other cat lovers, right? (*shifty eyes*)

This is how the relationship between users and pets might look like in your database:

user_pet_erd

Let’s say you have the following data in those tables:

user:

+----+----------+
| id | username |
+----+----------+
|  1 | Maggie   |
|  2 | Sully    |
+----+----------+

pet:

+----+------------+
| id | name       |
+----+------------+
|  1 | cat        |
|  2 | dog        |
|  3 | lizard     |
|  4 | parakeet   |
|  5 | guinea pig |
|  6 | snake      |
|  7 | unicorn    |
|  8 | ferret     |
+----+------------+

user_pet:

+---------+--------+
| user_id | pet_id |
+---------+--------+
|       1 |      1 |
|       1 |      4 |
|       1 |      5 |
|       2 |      7 |
|       2 |      8 |
+---------+--------+

How do I find out easily what kinds of pets my users have? Oh, it’s easy:

Maggie has:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Maggie'
   and up.user_id = u.id
   and p.id = up.pet_id;

+------------+
| name       |
+------------+
| cat        |
| parakeet   |
| guinea pig |
+------------+

And Sully has*:

select p.name
  from user_pet up,
       pet p,
       user u
 where u.username = 'Sully'
   and up.user_id = u.id
   and p.id = up.pet_id;

+---------+
| name    |
+---------+
| unicorn |
| ferret  |
+---------+

* Of course Sully has a unicorn! (He also poops nanchucks.)

It’s easy to find out what kind of pets users have one at a time. Let’s do it in one big swoop though:

select u.username,
       p.name
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id;

+----------+------------+
| username | name       |
+----------+------------+
| Maggie   | cat        |
| Maggie   | parakeet   |
| Maggie   | guinea pig |
| Sully    | unicorn    |
| Sully    | ferret     |
+----------+------------+

You get all the data you wanted, however, it’s all spread over many rows; some data aggregation is required! You’re probably thinking to yourself: “Oh, man, if only there were a function that works just like SUM() but for strings!”. If you’re using MySQL, you’re in luck thanks to the awesome GROUP_CONCAT() function. If you use it, you’ll get this:

select u.username, group_concat(p.name)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-------------------------+
| username | group_concat(p.name)    |
+----------+-------------------------+
| Maggie   | cat,guinea pig,parakeet |
| Sully    | ferret,unicorn          |
+----------+-------------------------+

Pretty swanky, eh? However, while group_concat() is totally awesome (oh, I love it so!), it’s only available in MySQL. (Although comments on this group_concat() praising blog post have an example of how to accomplish group_concat() in postgress, which you could also achieve in Oracle.) But I digress. Also, once you have the string representing all the pets, you’ll need to parse it in your application to get the pet names. And most importantly, this data is denormalized in a way that makes it a little difficult to enforce data integrity once you modify the list.

If group_concat() didn’t exist, what else can you do? Thinking of group_concat() as a sum() but for strings instead of numbers is an interesting approach. Next step: what do you have available out there that would yield a sum for a combination of values that could then be reverse engineered to get that unique combination of values again? That’s right, bitmasks!

How do you implement this in your user-pet scenario? It’s easy: first, represent bits in your bitmask as decimals on which you can later do math. Use a table to keep track of the bit-to-decimal translation for additional data integrity. Use those decimals as keys for your pets.

First, I create a table named power_of_two with the following values:

+----------+-------+
| exponent | power |
+----------+-------+
|        0 |     1 |
|        1 |     2 |
|        2 |     4 |
|        3 |     8 |
|        4 |    16 |
|        5 |    32 |
|        6 |    64 |
|        7 |   128 |
+----------+-------+

Then I use those values as IDs in the pet table:

+-----+------------+
| id  | name       |
+-----+------------+
|   1 | iguana     |
|   2 | cat        |
|   4 | dog        |
|   8 | lizard     |
|  16 | parakeet   |
|  32 | guinea pig |
|  64 | snake      |
| 128 | unicorn    |
| 256 | ferret     |
+-----+------------+

Note that I added an iguana as pet with id of 1 – this is 2 to the 0th power. I needed to add a pet at this spot to account for the rightmost place in my bitmasks (otherwise I have to shift everything by 1, which can be confusing).

We have 9 possible pets. If I have no pets, my bitmask will be 000000000 – or 0. If I have the first and the third pet, my bitmask will be 000000101 – or 5.

Assuming these changes to my database, let’s see what kinds of pets Maggie and Sully have:

select u.username, sum(p.id)
  from user_pet up,
       pet p,
       user u
 where up.user_id = u.id
   and p.id = up.pet_id
 group by u.username;

+----------+-----------+
| username | sum(p.id) |
+----------+-----------+
| Maggie   |        50 |
| Sully    |       384 |
+----------+-----------+

50 = 0b110010 – so Maggie has the 2nd, 4th and 5th pet. 384 = 0b110000000, so Sully has the 8th and the 9th pets.

Let’s assume your application heavily caches data that doesn’t change often – such as the pet table (not user_pet). Let’s say you have this cached:

$pets = array(
    1 => 'iguana',
    2 => 'cat',
    4 => 'dog',
    8 => 'lizard',
    16 => 'parakeet',
    32 => 'guinea pig',
    64 => 'snake',
    128 => 'unicorn',
    256 => 'ferret'
);

If your application already knows this:

Maggie: 50
Sully: 384

Then with a clever use of PHP’s decbin(), it should be really easy to display the right pet names.

This approach is great for when you have a very limited number of times you can connect to the database and once you’re there, you don’t have the luxury of running expensive aggregate queries. Also, web servers are easier to scale than coming up with error-proof database scaling solutions. I’ve also found this approach extremely useful when migrating lots of data about users (think millions of rows) from one system to another. Additionally, by having the power_of_two table, you’re able to be somewhat strict about the possible values for the user_pet table – this gives you slightly more data integrity than just using the concatenated string of names.

But the best part about using bitmasks is that you can do math on them! What’s easier to compare:

iguana,cat,dog,guinea pig,unicorn,ferret
to
iguana,cat,guinea pig,snake,unicorn,ferret

Here you have to possibly split the string on the comma into an array, then do array comparisons.

OR

110100111
111100011

Here you can use a bitwise AND to figure out the overlap!

Better yet, you can easily give people more pets!

110100111
OR
111111111

And now that person has all pets. And you didn’t even have to check if they had them before! Hooray for bitmasks!

Remember, as with many other optimization techniques, this one will not be appropriate for every scenario, but when you do need it, I promise it will work very very well!

P.S. I found it ironic to talk about a function that MySQL has but not Oracle – it’s usually the other way around. The str_agg (string aggregate) function I’ve had to write is so clunky – Oracle, please implement group_concat()!

© 2010 Maggie Nelson. All Rights Reserved.

This blog is powered by the Wordpress platform and beach rentals.