Wednesday, December 1, 2010

OOPs: mangling your database with objects

A developer once showed me a frightening sight: a page and a half of dense SQL, consisting mostly of join clauses. What was frightening was not so much the performance of this query -- it was a utility query, so the long query time was not important -- but the fact that he was actually proud of that query. How did a production DB schema get into such a state? The problem, in a word, was Hibernate. I'm a fan of Hibernate, so I'm not saying that there's something wrong about Hibernate. Rather, Hibernate lets you too easily do things that you should have second thoughts about. To borrow a metaphor, this tool comes with live chainsaws attached.




An industry veteran once explained to me that there are basically two kinds of developers: applications people and database people. Unfortunately, they tend not to be on talking terms, and ORM is exactly the sort of place where application and database collide. Taking an excessively application-centric view of the universe can land you in the kind of trouble I opened this blog post with.

In this case, the Hibernate mapping started with a single base class with common metadata: ID, modification timestamp etc. All persistent classes inherited from this base class. The problem came from the decision to actually manifest this class inheritance in the schema using table-per-subclass mapping, and to have a  extensive inheritance hierarchy. There is a table representing the base/abstract class, a separate table for each top level entity, and another table for each subclass. So every persistent object required at least a join between this base entity table and the table representing the concrete class. Any association guaranteed at least a 4 table join, and usually many more.

Considering update performance, the situation looks even worse. All updates required writing to this base entity table, no matter what object you update. MySQL's default storage engine is MyISAM. If you use any engine apart from InnoDB, any kind of row insert/delete/update causes a table lock. This schema meant every single DB update had to wait on the same table lock. That same developer said his schema suffered from poor write performance on MySQL. Gee, I wonder why?

An aside: if you are using MySQL and are not using InnoDB as your storage engine, please ask yourself why.

I would say that the main problem here is an excessively object-centric approach to ORM. An application-oriented mindset would focus too much on the O (object) part of ORM and forget the fact that you are working with a relational (R) database. Using table-per-subclass mapping "feels" natural if you are working with objects, and Hibernate makes it so easy to do so. But unless you really need polymorphic queries, there is little need to do so. And even from an OOP perspective, inheritance overuse has long been a problem. In the single base entity scenario I describe here, there is clearly no meaningful is-a relationship.

The fact that Hibernate lets you express OOP constructs so easily should never blind you to the fact that you are ultimately working with a relational database. Inheritance, in particular, maps poorly to relational databases. ORM is a leaky abstraction, and should not be an excuse to ignore the SQL. Respect the database, and be prepared to make compromises in your object model.

11 comments:

  1. If the problem is that an OO programmer doesn't know SQL then the solution is *NOT* to employ an object-relational mapper such as Hibernate, it is to teach him SQL and relational theory. It's not that difficult. In fact the rules of data normalisation are much easier to follow than the theories of object oriented design.

    I was designing and using databases, both relational and non-relational, for many years before I moved from prodedural to OO programming, so for me the starting point of any database application has always been a properly normalised database design. Having identified all the relevant entities and their relationships I then skip OOD completely and create one class for each entity (database table). All the common code is inherited from a single abstract table class. This means that I do not "overuse" inheritance by any stretch of the imagination. I have never used an ORM, and I never intend to. I consider them to be an abomination and a sure sign of mental weakness.

    The idea that you design your database to support your objects is something which I consider to be a huge joke. It is much better, IMHO, to design your objects to support your database design. I have been using this technique very successfully for many years, and I have even built my own RAD framework around this simple concept.

    ReplyDelete
  2. One of the best writers about the intersection of object-oriented programming and database design is Mike Blaha (one of the original 4 authors of OMT which eventually became UML). He has a new book from CRC Press which I recommend highly.

    ReplyDelete
  3. thinking about this a bit more, I think the fundamental issue may be the impedance mismatch between databases as a way to externalize data from an application so that they're useful in multiple contexts, and databases as a way to persist objects in a single application. I see quite a few applications that use full relational DBMSs for object persistence even though no other application will ever read the stored data, let alone attempt to update it. The commodity pricing and availability of RDBMSs makes this an appealing choice, but true object databases or one of the new breed of "No SQL" DBs would probably be superior.

    ReplyDelete
  4. Chris, I greatly enjoyed your post.

    There is an old-fashioned idea out there that works extremely well. It goes like this. A database is a permanent record of facts. The rules of normalization give us a very well understood way to organize those facts. Once you have a schema to record your facts, you code an application to manipulate them. Most experienced developers will tell you that once the schema is sound, the app writes itself. However, if you go backwards, beginning with app code and then "persisting" it, you get the garbage you describe in your post.

    Here is a thought I've been considering lately. When E.F. Codd proposed the Relational Model in 1970 at least half of his original paper was an explanation of normalization. In other words, the Relational Model was born complete with a rational method for using it. Object Orientation, by contrast, has no theoretical basis and so we cannot derive a "correct" way to organize classes out of the theory because there is no theory. So everybody makes up their own way, and whoever shouts loudest wins. In the end you have abstraction contraptions (or if you want to be nasty, abstracturbation) like the one you describe above.

    As for the dangers of ORM, Tony Marston said it better than I could. He ought to copy that comment directly into his own blog as a post.

    ReplyDelete
  5. A major part of the application vs database conflict is whether we think of the database as ontologically independent of the application. This is something you have alluded to, and I think it is critical. If the database is merely the slave of the application -- and not a peer -- then it gets warped, mutilated and abused. The application-only mindset does not care how ugly the SQL looks coming out the other side. On the other hand if you actually think the data is valuable apart from the application, maybe you will treat it with more respect.

    But there are benefits to granting the relational database peer status, even if you cannot foresee the data being used elsewhere. Contrary to Ken M's opinion, I think you can lose something by dropping an RDBMS' framework. An RDBMS has powerful guarantees regarding data integrity, whereas application logic can be buggy, leading to corrupt data. But perhaps most importantly to me as a developer is the fact that I know how to make ad-hoc queries on any SQL database out there. This is not just important for product features like data mining, reporting and auditing, but also for developers like myself who want to extract, aggregate or analyze the data to figure out what is happening to my application.

    That said, I do not share the disdain in the comments for ORM. Hibernate is only a tool, and can be abused by its practitioners. It is a tool not only for OOPers, but also for data-oriented people. Remember, your database does not have to be a slave to the objects, not even with Hibernate. Hibernate has powerful features to map even legacy databases that were never designed for OOP.

    I don't believe Hibernate should be an excuse to ignore the SQL. I personally like to log the SQL that Hibernate generates to check what it is doing. If you treat the database with respect, I believe Hibernate can be a powerful and useful tool to bridge that gap between SQL and application code.

    ReplyDelete
  6. @ Ken Downs who said: "As for the dangers of ORM, Tony Marston said it better than I could. He ought to copy that comment directly into his own blog as a post."

    I created an article on this very subject in August 2007 at http://www.tonymarston.net/php-mysql/object-relational-mappers-are-evil.html

    ReplyDelete
  7. Chris, you said:

    "Hibernate has powerful features to map even legacy databases that were never designed for OOP."

    This reveals, I suspect anyway, a deep difference in assumptions between us. Databases do not need to be designed to "map" to anything, because their job is store data.

    A Relational Database operates on principles appropriate to its goal: data storage. The most important feature of all modern DBMSs is, IMHO, ACID compliance. This guarantees you can trust the Database to handle simultaneous activity in a predictable fashion. The other major principle, normalization, guides the developer in his area of responsibility: organization.

    So: you do not design a database to cater to an app, you design it to accurately store facts. There is no such thing as a "legacy" database that does not know about OOP.

    The job of application code is to grab data and present it to a user, to accept user requests to change data. It is also usually responsible for large amounts of massaging that can technically be handled in either tier, but that is a debate for another day.

    Before OOP entered the mainstream commercial world, all developers understood this. They designed code to be good code, and databases to be good databases. But OOP brought a sea change, suddenly people were seriously suggesting that we change the principles of database usage to cater to the coding method of the app. This is, to put it mildly, extremely bad engineering.

    ReplyDelete
  8. KenDowns: somehow you read the exact opposite meaning of what I was trying to say. I think you need to understand Hibernate more before we can meaningfully disagree.

    To "map" is to tell Hibernate how to interact with a target database: how to read data into objects, how to write updated objects back to the database, what constraints are present etc. You can map a perfectly normalized, DBA-approved database to Hibernate without making any changes to the schema itself.

    As I said, Hibernate is a tool, and a tool can be abused. It is the practitioner who needs brings a degree of respect for the database.

    ReplyDelete
  9. Chris, I believe I did understand what you are saying, basically, "Hibernate is cool, does not require a database to have been created by Hibernate, but the developer should be aware of good database principles." The theme of my own blog is a generalized version of this: no matter what language or tool you are using you need to understand the database.

    If I appear to disagree or to be making tangential points, it is because, I believe, that the very vocabulary of ORM in general and Hibernate in particular works against this goal that we appear to have in common of respecting the database. I chose to pick on the term "legacy databases" precisely because it implies a database needs to be set up for mapping, and it is necessary for you to point out that it is not. The vocabulary itself is working against you and forcing you to make points that should not need to be made.

    If I were to pick apart other phrases we would be here forever, so I will leave it at that. I have a post on this topic coming up, probably on Wednesday, with a link to your OP. Which by the way, was a great post. Looking forward to more.

    ReplyDelete
  10. I don't really see the problem with ORMs. As with everything you can use them effectively or really badly. A great case in point is the pythonic web framework Django. It provides methods for total OOP like structures with inheritance hierarchy mapped onto RDBMS as Chris describes above (bad) as well as completely normalised designs (good). There is a very close link between the model/class design and the actual table output so as with everything else python - explicit is best. It's perfectly possible in Django to create really great RDMBS architectures that are totally portable beyond using Django. Ok granted you cannot specify more advanced RDMBS objects like views, triggers, stored procedures etc, but when you need these tools then it’s simple enough to get them to work with the ORM afterwards.

    What I do see a lot of is peeps beginning development without ever touching any SQL. This (in my opinion) is where the real problems are precipitated. Without a decent understanding of the underlying data model, how this is applied to RDBMS etc then ORMs just make it far too easy to forget all about D-R-Y.

    Django makes heroes of every would be web app developer, but it also allows for villains!

    ReplyDelete
  11. I think the problem is that hibernate and similar ORM tools try to map an inherently non-relational object world into a relational world. This doesn't work very well as you point out. However, I am fond of mybatis (formerly ibatis) as it allows one to map relational data onto objects; which seems to work far better. Of course this might be because I am very RDBMS-centric. I believe in getting the database correct first... then MyBatis allows one to map (oh no) DAOs onto the various sql statements (Queries and CRUDs). The bottom line is that I think it is far easier and more sensible to map relations onto objects than the other way around.

    ReplyDelete