Monday, November 22, 2010

DVCS vs Subversion smackdown, round 3

That title is probably a little pretentious, since the topic of rival version control systems has probably gone a few zillion more rounds than that. The context here is that in a previous blog post I argued that the recent rise of DVCS (Git, Mercurial) is less due to their essential D (distributed) nature, but the accidental weaknesses of Subversion, the dominant centralized VCS. The weaknesses I mentioned are:
  • Poor merging
  • Lack of offline commits
Since the Subversion folks are busy rectifying these weaknesses, I questioned whether DVCS is really the wave of the future. That was round 1. Round 2 came when my colleague Pete came galloping to the defense of DVCS, and I would also include the comments in response to my blog post. This is round 3, the response to the responses.


How we got here

I'd like to step back a little and look at history once again, this time for Subversion. SVN was built to be a better CVS, and in that respect it succeeded beyond its wildest dreams. By 2007, it became the dominant standalone SCM, among both open source and commercial products. This does not mean it had no weaknesses: for all the pain ClearCase inflicted for other reasons, SVN's merge tracking was inferior even then. But by then, the "easy" stuff was done, SVN became dominant, and I suspect the project got complacent. It has succeeded, and the motivation to sponsor developers to work on SVN probably faded.

This is not to say that the project was idle. There was bug fixing work. There was "enterprisey" stuff like security and LDAP authentication. And there were heavy infrastructural work (like the WC-NG work) going on that had deep ramifications for the back end but yielded little short term functionality. In the meantime, DVCS products demonstrated the benefits of superior merging and offline commits. Suddenly, SVN got real competition. This woke some people up, SVN got more full time developers thrown at it, and things got cranking again. So to those DVCS fans who asked what took SVN so long to start catching up, I'd say "thank you" to DVCS for lighting the fire under SVN. Subversion is now a full Apache project with an impressive roadmap that responds to the DVCS challenge.

The D in DVCS

Looking back at the responses to my blog post, I feel vindicated in my original position: that the attraction of DVCS has nothing to do with the D. In addition to offline commits and merging, people also mentioned performance. Fair enough. But none of these are essential to the D in DVCS. Here's my distinction: your VCS is a distributed version control system if your repository is ontologically independent of its source. That is, its existence is meaningful even if the other repository you pulled from disappeared. Git and Mercurial are true DVCS in that sense, and reflect their genesis in Linux development. E.g., Linus Torvald's Linux kernel repository was independent from Alan Cox's: they overlap but differ in places on which patches were accepted. While Linus was the acknowledged Linux leader, it was Alan's repository that reflected Red Hat's shipping product. SVN is not and will not be a DVCS: all local working copies derive their meaning (revision numbers, history) from the one true central repository.

This definition is independent of how much of the "remote" repository you keep locally. After all, Subversion started the trend by keeping the base revision locally alongside the modified copy. This is why some SVN operations like diffing and reverting to base are fast and do not need the central repository. But this is just a form of caching. Similarly, you could "cache" your outgoing commits: this is the promised shelving and offline commit functionality. SVN could conceivably copy the whole blooming history to your hard disk and it would still not be a DVCS: because authoritative version is on the server and what you have is derived -- past, present or future -- from it.

This distinction of the D in DVCS is critical, as we shall see later. Being able to commit, shelve, do stuff offline and eventually push stuff to an authoritative central repository is not true D: you're just caching your outgoing commits. If people argue that DVCS systems are generally used in centralized form in practice, or if products like Fogcreek's Kiln mandate centralized DVCS (uh, oxymoron alert?) ... maybe what most people need isn't really DVCS. Maybe what they need is a central VCS with some of the benefits of these DVCS systems, something like Subversion with its weaknesses removed.

When D hurts: revisions vs changesets

In theory, Subversion's revisions and the concept of changesets are two ways of looking at the same thing. Revision 1234 simply reflected all changesets applied to that point, with the last changeset being the difference between r1233 and r1234. The difference comes from the way DVCS handles changesets. While boosters claim that DVCS is merely a superset of VCS, this is an example where the difference leaks through painfully. The basic problem is that a DVCS changeset is a 40 character pile of gobbledygook that no human can remember.

There is no easy answer to this problem. Pete tries a little sleigh of hand, asking me to compare r4817:4907 to bug_383. Ah, but the sleigh of hand is that r4817:4907 really represents 90 changesets. If you had the foresight to add tags like:

these_ninety_changesets_that_I_know_I_will_need_to_revert_Tuesday

or

these_ninety_changesets_that_I foresee_I_need_to_merge_to_future_branch_foo

I salute you. But in reality, if you want to arbitrarily refer to changesets, those 90 changesets that you pushed to the central repository will look something like:

8663e07862fbaf14df62d296df46c375baf14df6
82e905cf73ccec259efe6594ebb30c0bdf31125b
df31125bf70fa1529dda7d43ddb1778c07862fba
575e81d4216ac005849b53a2b1a023fc6eff61ea
(... etc).

Imagine 90 changesets like that. Though Pete claims this gunk is rarely used, when he demoed git to us he had to do a cut&paste of exactly that stuff. I also find fredrik's response to my blog weak: yeah, git lets you specify only enough characters to be unique, but trying to cherry pick and find unique characters from a few thousand changesets will make me cross-eyed. It hurts. Admit it.

These ugly changesets are essential to the nature of DVCS, but not to VCS. You sort of need them because DVCS repositories are ontologically independent, so human-readable IDs like Subversion's are not theoretically sound. Mercurial's FAQ frowns at human-readable IDs. Here is a fundamental challenge for DVCS: must they stick to their true D nature, to the point of throwing this ugliness at us even when we don't need it?

When D hurts: complexity

When Joel Spolsky leads into his Mercurial boosting by saying that Subversion users are "brain damaged" and "need a little re-education" ... maybe it is DVCS that has the complexity problem (I don't think his description of Subversion there is accurate nor fair either). Pete addressed this issue simply by claiming for one particular workflow the steps are similar (it's painfully complex for either Git or SVN, IMHO. I bet nobody but Pete uses this workflow at our office.). But it is just one specific workflow. Other operations are simpler. In real life usage, you can't beat the simplicity of update/commit, as opposed to pull/merge/commit/push. There is essential complexity in DVCS because you are working with two repositories at all times (yours and the remote one). The care and tending of both repositories is unavoidable with DVCS. Maybe the real "brain damage" is in requiring essential complexity even if you don't need it.

Living with Subversion

Subversion will take a while to get to that land of "better VCS". But even with the promised enhancements, there are a few practices that can keep us sane. During the Kiln presentation I referenced in a previous post, the guy described git as a Swiss army knife with a few chainsaws attached. In a different way, this is true also of Subversion. Subversion relies on convention rather than enforce standard VCS practices. For example, it lets you arbitrarily "copy" willy-nilly, relying on conventions for you to decide what operation is a branch and which is a tag. It is maximum flexibility at the cost of a few chainsaws here and there, such as letting you commit to a "tag". Establishing and adhering to branch/tag conventions is important.

Another convention that would make life easier is to avoid intra-branch merging, something almost never needed. If you only branch/merge at the repository level, the merge metadata is much simpler. This is "normal" branching/merging. It is only when you do merging at sub-repository levels that the mergeinfo gets complicated. Jakub in the comments claimed that SVN's merging is impossible to get on par with Git, but that is simply due to that swinging chainsaw of intra-branch merging. Stick to top level merging and the merge model becomes identical to DVCS', and his distinction between merge tracking and tracking merged-in revisions disappears.

One thing totally unrelated to DVCS but is a pet peeve of mine: clobbering SVN history on refactoring. When tracing a particular source file's history, my biggest pound-head-on-screen moment is when I hit a log comment like "changed package to XYZ" and an abrupt end to the history. This happens when people commit the equivalent of a delete+add, so the newly relocated file has no revision history. When IDEs like Eclipse+Subclipse can do a proper "svn move" on refactoring, there is no excuse to do a brute force delete/add to move a file.

Another word about git-svn. This note will hopefully be obsolete some day, but I do want to reiterate that git-svn does not correctly update mergeinfo, which is how Subversion tracks merges. Pete sort of defends git-svn by saying that there is inconsistent client support. IMHO, this is no defense at all. Mergeinfo is how Subversion tracks merges. This has been true since 2008. If a client fails to update mergeinfo when doing a merge, it is in effect causing data corruption in the Subversion repository. Any such client is a defective client. Use it to merge only if you know how to correctly update mergeinfo yourself (Pete does).

Did you know that you may already have local VCS? IDEs like Eclipse and IntelliJ have had "local history" functionality for some time, giving you the equivalent of local checkpointing/diff/rollback. SVN's future checkpointing functionality does not excite me too much because of this, but I realize not everyone may be using the same IDEs.

The future

I look forward to when Subversion starts to eliminate its annoyances, chief of which is better merging. It's not so obvious from dull-sounding projects like WC-NG, but such work seems promising when you look at the details. Better conflict tracking, true rename support translate to better merge handling. The elimination of those pesky .svn directories will be a nice plus. The WC-NG work and HTTPv2 work also both promise significant performance improvements. So if we get the promised better merging, better performance and offline commits without the unnecessary pain that DVCS inflicts, perhaps Subversion can be a better solution after all. At least, I don't see much reason to prefer a DVCS to this future Subversion.

No comments:

Post a Comment