Thursday, April 14, 2011

NoSQL: triumph of application over data

NoSQL databases are the rage now, aren’t they? Not too long ago, NEJUG hosted a presentation by Tim Berglund called "NoSQL Smackdown." Before giving his tour of several NoSQL databases, Tim tried to give his definition of "NoSQL." After showing the difficulties of defining NoSQL as a series of negatives (no SQL, no relations, no transactions), he merely settled on a vague definition. Basically, he says, NoSQL is merely a set of different approaches in data models, querying and scaling that trade off things like performance and the 3 guarantees of the CAP Theorem.

In other words, there is no such thing as a canonical NoSQL database. Most emphasize key/value or document-oriented storage, but then you have something totally different like Neo4j, a graph database. NoSQL fans like to emphasize scalability, but some candidates don’t really scale. You can’t even say that NoSQL means no SQL, because you have something like GemFire, which does allow you to use SQL. Nevertheless, allow me to give you my own definition of NoSQL. Ready? Here goes:

NoSQL databases are databases for people who don't care about data.

While you are cleaning off the coffee that you just sprayed all over your computer screen, let me explain. Of course you care about the data. You care if it was written correctly, you care if you can read or write it fast, and you would be upset if it got corrupted. You want it to scale -- in both size and activity -- and be tolerant of hardware failure. But all these concerns have to do with doing stuff with data, working with data in motion. This is one difference between an application-oriented mindset and a data-oriented mindset: whether we are concerned with the data per se, the data at rest. Much of the discussion about the relative merits of NoSQL vs relational databases is done from an application-oriented mindset. When we talk about the CAP Theorem, ACID compliance, transactions, performance, scalability etc we are concerned primarily with the usage of the data. But even if there is no performance issue, even if there is never a hardware failure, even if only one client accesses the database at a time … is that data good?

The importance of being normal

Emphasis on data correctness, I think, is where NoSQL and conventional relational databases truly part company. In particular, one trait that seems to be common to all NoSQL databases is the absence of a fixed schema. The data could be key/value, nested hash tables or JSON documents, but they are all relatively freeform. As an application developer, I can see the attraction behind schema-less databases. I would not have to worry about how to shoehorn my objects into a rigid, tabular format, nor make complex queries to reassemble them later. I would not have to deal with DDL and SQL migration scripts. After all, my application owns the data. But what if it does not?

Maybe one test of whether you are approaching the database from an application v.s. a data centric mindset is your sense of whether the data is ontologically independent from the application. Is the data useful apart from the application that populates it? If so, wouldn't it make sense to structure that data with constraints that guarantee some level of correctness, even if that structure makes it less convenient for the application to use? That is why we have this huge area of relational theory covering database normalization.

I'm surprised at how many application developers don't seem to grasp normalization. I've interviewed people who have written tons of SQL and mention normalization, but when asked can't actually articulate what it means. Not that I can memorize the definition of every normal form, but folks can't even explain why they want normalization. Some tell me it's for performance (quite the opposite, in fact). The point of normalization is to reduce duplication of data. When you have redundancies in relations you have problems keeping them consistent (updating), or data anomalies appear when you insert or delete partial entities.

NoSQL databases don't seem to care about normalization. Granted, some of them have data models that are sort of tabular. But none of them really support a normalized schema. You generally can't do joins. Outside of graph databases there is no robust linking capability needed for lossless decompositions. Trying to build a normalized database without foreign key constraints is like building a Star Trek teleporter that regularly loses a random body part (sorry, we lost your arm near Jupiter). 

Relational databases also provide other data guarantees: typed data columns, not-null constraints and other value restrictions. These data integrity guarantees help maintain the correctness of your data even if you have bugs in your application. Users of that data apart from your application can depend on the constraints of that schema. Your data is more resistant to application bugs. To use a programming analogy, it's like choosing to use a strongly typed language.

Certainly, there are advantages to a schemaless NoSQL database. In many cases, you do not need the data guarantees that the relational model provides. But we should make that decision with open eyes, knowing what we give up: query-independent data with integrity guarantees that allow you to rely on a certain level of correctness.

1 comment:

  1. This is all correct, you are on a good path. Except for unclear phrases like "typed data columns", "shoehorn objects into tabular format". If you want better defined terms and - as a side effect be the kind of guy who is brought in to fix NoSQL/OODBMS disasters - read this: