Friday, October 29, 2010

Beware the magic flush

It often surprises me what a good Java profiler can tell me about Hibernate performance issues. Recently I was studying a performance issue that seemed straightforward. There were lots of objects, lots of writes and lots of queries. Preliminary profiling showed one particular Hibernate query dominating the operation. That's not surprising, I thought: it's a frequently made query, and it's probably slow. I'll just find a way to speed it up, maybe with an index.

As it turns out, it was a simple, quick query. There was little hope of speeding that query up much. But further profiling showed that very little time was spent actually doing the DB query. So what was Query.list() doing that was taking so long? It was the implicit flush. The Hibernate session keeps track of objects that you have loaded, looks for modified objects and writes those modified objects to the database when flushed. A final flush happens upon transaction commit. Additionally, Hibernate by default behavior flushes dirty objects to the database before making a query, to ensure that the query is made on data that reflects all updates made to that point. If you have lots of persistent objects loaded, each flush can be expensive because with each query:
  • Hibernate has to check all persistent objects in the session for changes 
  • Hibernate will write any updated objects to the database. This can be wasteful: 
    • The updates may not be complete yet, so these intermediate writes are unnecessary
    • Spreading the DB writes across reduces Hibernate's opportunities to batch SQL statements for performance compared to writing everything at the end of the transaction.
The fact that flushes can happen during queries is significant. If you realize that an operation that interleaves object modifications and queries, you might consider the potential for significant speedup just by preventing that interleaving. For example, you could make all the queries up-front before adding, modifying or deleting any persistent objects so the only flush would be the final pre-commit flush.

DVCS for all the wrong reasons

A lot of virtual ink has been expended arguing the superiority of the distributed version control systems (DVCS) like GIT or Mercurial over Subversion, the dominant open source "traditional" VCS. Representing a substantial paradigm switch, DVCS software is hip and adoption is growing. I will argue that DVCS adoption is driven not by the essential superiority of distributed VCS (the D in DVCS) but by the temporary, accidental weaknesses of Subversion.