Thursday, December 17, 2009

The cloudy future of relational databases

Cloud computing is the big new thing. The fate of the formerly ubiquitous relational database, on the other hand, is uncertain. There are many ways for a developer to deploy his application to the "cloud", whatever that might mean. On the other hand, the various cloud vendors do not necessarily have a database story to tell, or if they do that story isn't necessarily a good one. As a result, deploying a traditional database-backed application to the cloud may be difficult.

One story I have heard is to simply opt out of using a relational database. You would use some "NoSQL" persistence technology, such as SimpleDB on Amazon or Google's datastore. Such solutions have their advantages in terms of performance, scalability and ease of mapping to objects. If you just want to store dumb data objects, that may be all you need. But there are situations where the traditional RDBMS is more appropriate. If you have a data-centric view of your problem domain, where the data is ontologically independent of the application and can outlive the application, where you want the database to guarantee certain transactional and integrity rules, where you want to do lots of ad-hoc querying ... you might just want an RDBMS. My worry is that for many cloud platforms, you may not have that option.

For the platform as a service (PaaS) sort of cloud, where you may not even have filesystem access, setting up your own database server will not be an option. Even where the vendor provides a database service, your choice remains limited to that one service. For example, if you use Azure you can only use SQL Server. The infrastructure as a service (IaaS) sort of cloud will generally give you that freedom, but it may have limitations. Generally, IaaS vendors like Amazon offer:
  • a compute service: a virtual machine with local filesystem storage. Small, fast storage.
  • a storage service: an API to read/write data on the cloud. Big, slow storage. 
So you have a choice of small, fast storage (EC2) or big, slow storage (S3). Trouble is, a database server really wants big and fast storage.

At this point, I think the reader of this blog may want to slap me on the head and point out that yes, Amazon tells a perfectly good database story. In fact, it must have at least half a dozen stories to tell. You can use their MySQL service, put up an Oracle AMI or stand up your own database server: there are lots of choices both from Amazon and third parties. There are choices partly because Amazon's EC2 can use elastic block storage (EBS). This is a filesystem which is both big and fast, especially if you set up several EBS volumes in a RAID configuration. I noticed that my former colleague Chip recently blogged about his success deploying database-backed applications to Amazon's cloud.

OK, so you have lots of database choices if you go with the Amazon cloud. But only Amazon. I have been talking to another cloud IaaS vendor without a database story, being stuck with the fast/small vs slow/big dichotomy. With other vendors, it seems, you either have just one choice of RDBMS or you simply cannot get a production-ready RDBMS in the cloud. Did I miss an another option or two? You tell me.

Wednesday, December 9, 2009

Hibernate formulas do not like subselects

Hibernate formulas and the subselect fetch strategy are currently incompatible. There is already a bug associated with it: HHH-1400. I am linking the bug at the beginning because blog posts like this have a limited shelf life. If you came across this post from Google, it is possible that this note is already obsoleted by a bug fix. I have little hope of this happening very soon right now, though, seeing that HHH-1400 was opened almost 4 years ago.

What's a formula? A formula is a Hibernate feature that lets you map a property to an SQL expression rather than a table column. Hibernate would automatically compute the value of the property for you each time you load an entity. A formula in annotation form looks like the following:

@Formula("(select count(*) from foo where ...)")
public Integer getFooCount() { return fooCount; }

I think it is an elegant feature. Unfortunately,  this feature seems to interfere with the subselect fetch strategy.

Why do subselects matter? The subselect fetch mode is an important tool to deal with the n+1 select problem when loading entities with one-to-many relations. Suppose we had an entity called Parent with a collection of Child entities. Suppose also we load n Parents and eventually want to iterate over each element in Parent.children. The typical Hibernate-generated SQL with a lazy-loaded collection would look something like:

select * from Parent where ...;
select * from Child c where c.parent_id=?;
select * from Child c where c.parent_id=?;
select * from Child c where c.parent_id=?;
...

In other words, we would have n+1 select statements, which is bad if n is large. We could do a join between Parent and Child to eagerly load the collection, but this only works with one collection. If Parent has multiple collections, a join-fetch of multiple collections would lead to a Cartesian product, retrieving a prohibitively large number of rows. This is where the subselect fetch strategy comes in. If mapped to use the SUBSELECT fetch strategy, we would only need two select statements, with the first one's where clause reused in the second:

select * from Parent where ...;
select * from Child c where c.parent_id in 
   (select id from Parent where ...);


Unfortunately, if you had annotated a property with a @Formula, Hibernate will mangle the SQL, leading to an SQL parse error:

select * from Parent where ...;
select * from Child c where c.parent_id in 
   (select ...) as formula0_1_, ...;


If you really need to use subselect fetches, you'll have to find an alternative to @Formula for all its elegance. Possible solutions:
  • Compute the properties with separate queries only as needed. This is tricky if you frequently need those computed properties.
  • Sort of "denormalize" (I'm sure I'm misusing that term here) by persisting the computed values in the table. This works if the computed values rarely change, but is a problem otherwise.
  • Have the DB manage the computed values using views. Eyal Lupu has a great blog entry on how to do this. Obviously, this only works if your database supports views, and requires DB-specific SQL.

Thursday, December 3, 2009

BlackBerry development: big things happening

The buzz in mobile development seems to be on other platforms. People outside of the BlackBerry community probably haven't been paying much attention, but there has been some major announcements in the area of BlackBerry development. These announcements, coinciding with the recently concluded BlackBerry Developer Conference, show that RIM is actively responding to its competition on several fronts. RIM's BlackBerry Developer's Blog has been covering these developments faithfully, so go there if you want to read from the official source. Here, I will go over what seem the most significant announcements to me as a developer.