Friday, October 29, 2010

DVCS for all the wrong reasons

A lot of virtual ink has been expended arguing the superiority of the distributed version control systems (DVCS) like GIT or Mercurial over Subversion, the dominant open source "traditional" VCS. Representing a substantial paradigm switch, DVCS software is hip and adoption is growing. I will argue that DVCS adoption is driven not by the essential superiority of distributed VCS (the D in DVCS) but by the temporary, accidental weaknesses of Subversion.



This was emphasized to me a couple of days ago when I attended the Boston session of the FogBugz and Kiln World Tour, presented by Joel Spolsky and friends. Now, Joel is a big booster of DVCS and Kiln is built on top of Mercurial. But during the "DVCS University" presentation, what struck me about Kiln is that it requires a central, authoritative repository. In other words, they took the D out of DVCS. I asked them about the missing D, and the presenter made it clear that the centralized repository is the "right" way to do it. But if you have a centralized repository, you no longer have a distributed VCS. In other words, Fogcreek uses Mercurial as a "better Subversion". There is no D there.

Questioning the D

Like just about all DVCS advocates, Joel's crew emphasized the ease of merging as the main selling point of DVCS over centralized VCS. We know that merging with Subversion stinks, but why is good merging synonymous with using DVCS? Some also emphasize offline commits, but again, why is that essential to DVCS? I suggest that the attractiveness of Git or Mercurial in most circumstances has nothing to do with distributed version control.

Keep in mind, folks, how DVCS became popular. You may have heard of this guy named Linus Torvald. Running the entire Linux kernel development project is a major undertaking, yet for a while Linus would not use a version control system. None suited his needs at the time, until Bitkeeper came along. This DVCS worked well for him, but controversy over the commercial license forced him to give it up. In true Linus fashion, he went ahead and wrote his own, and Git was born. Mercurial too was written separately to replace Bitkeeper. My point is that modern DVCS was born to address the needs of open source Linux development. Linux development is hierarchical: Linus gets changesets from trusted subsystem maintainers who in turn receive code from other maintainers. A hierarchy of varying trust and authority implies a hierarchy of independent repositories. DVCS fit perfectly in this use case.

But distributed, independent source repositories is not how most commercial development works. You want backups and security. You don't want team members to "go dark" with their work for extended periods only to push mega-changes onto your team's code, breaking everything in sight. Early, frequent integration means you want to keep your code unified. So in practice, your source repository will be centralized. When people long for Git or Mercurial, it's because they want better merging on their VCS, and maybe offline commits. They are not clamoring to maintain peer source repositories. There is no D there.

Is merging that bad?

As an aside, I already acknowledged that Subversion stinks at merging. Yet I frequently do merges on Subversion, and almost always painlessly. Subversion merging is not as bad and DVCS merging is not as effortless as the DVCS boosters seem to make it sound. Subversion bashers don't always acknowledge that modern Subversion tracks merged changes (mergeinfo). Also, Git users can turn their zeal into self-fulfilling prophecy by using git-svn to do their SVN merging: this tool doesn't handle mergeinfo, subverting Subversion's merge metadata and ensuring future grief. In my experience, the real pain of merging is essential rather than accidental: there are real code conflicts, and a real human needs to reconcile these conflicts line by line. I have read a lot of hand-waving, but I have never seen a concrete example of how a DVCS better handles essential conflicts in merging.

Back to the future

What if you can get the benefits of better merging and offline commits without distributed versioning? Well, check out Subversion's development roadmap. Future enhancements include:
  • Commit shelving: you can "shelve" current work offline and revert your working copy for other work, then "unshelve" that work later to resume your original work.
  • Checkpoint: a form of offline commit that lets you "commit" a revertible stack of changes offline.
  • Better merging from a revamped WC metadata library
  • Rename tracking (Finally! Will help merging)
  • Improved tree conflict handling
These changes offer enhanced merging and offline commits while keeping Subversion's identity as a centralized VCS. The shelving and checkpointing are really a way to let you defer (and rollback) your commits, but they do not set up an independent repository. There is no D there.

If a Subversion user can get the primary benefits of DVCS without switching to a DVCS, why switch? If "DVCS" for you means getting better merging and offline commits, and you are already using Subversion, then you do not need to seek DVCS. DVCS will come to you. There are those who will genuinely need the distributed repository nature of a DVCS, but I suspect that number is far smaller than those who merely want better merging.

If "Future Subversion" is all you need, should you still go with a DVCS? DVCS advocates argue that a DVCS offers a superset of traditional VCS. The problem with that argument is that any unnecessary superset adds unnecessary conceptual complexity. If I only need an "svn commit", I should not need to execute both a "hg commit" and a "hg push". Consider: this is what a Subversion revision number looks like:

  • 7843
This is a convenient, readable number that I get for free from Subversion without tagging or other gymnastics. It gets used in CI builds and bug reports. This is what a GIT/Mercurial changeset ID looks like:

  • 1438e82fac1c2191394e67257b7b94e05c7caa2f
Which would you rather use?

Reality check

Of course, "Future Subversion" is not here yet. We'll see if the Subversion project can deliver within their promised timeframes. And I will also say that DVCS offers more practical advantages than just better merging. All I wanted to point out in this post, if you have not already fallen asleep reading so far, is that:
  • The primary motivations driving DVCS adoption has little to do with the concept of DVCS
  • If Subversion improves to the point of eliminating those advantages, would the idea of DVCS still be as compelling?
Update 11/22/10: Pete posted a response below. Here is a proper link. I have blogged a response to his post and other comments.

    8 comments:

    1. Hey Chris,

      I'd argue that:
      a) distributed systems don't have to be hierarchical,
      b) "offline commits" really just simulate the distributed repositories

      A hub-and-spoke network with repositories at the end of the spokes is definitely distributed. It's just not a full mesh network.

      That said, I do think that git's complexity is a bit much. Sure, you can "shelve" (a.k.a. "stash") and branch and merge as much as you like and then and push to whomever you like, but 99% of the time, that's not what git users do.

      People either stash or branch locally (but not both) and commit to their local repo. Then, when they have something solid, they push to a central repository.

      This is how github was meant to be used.

      I don't see any mention of local branching in this "Future Subversion". I've yet to actually do that, but I really like the idea. You make a good point about developers "going dark", though, so maybe all branching should be visible to the entire network.

      It seems that for 99% of the use cases, this "Future Subversion" will be just as good as git.

      But still… where is it? And why has it taken them so long to add these features? I've already moved over to git for all of my personal projects. It's not too late for subversion to come back into my life. They just better upgrade it soon. Really soon.

      ReplyDelete
    2. I don't think that it would be easy, if at all possible, for Subversion to get support for merging on the par of the one used by Git, especially in more complicated situations (file renames, criss-cross merges, etc.). The difficulty of handling merging in Subversion is caused by the model of branches that Subversion uses (branches are copies, branch history is just history path-limited to folder representing branch).

      The new `svn:mergeinfo` property (for which, IIRC, both server and client have to support this) is not about merge tracking (and saying so in documentation is misleading), but about remembering merged-in revisions.

      See also description in this post (and other in its thread): http://thread.gmane.org/gmane.comp.version-control.git/158940/focus=159438

      ReplyDelete
    3. Let's assume that Subversion somehow gets equivalent of "git bisect" / "hg bisect" (it can be hard with model of branching used in Subversion), which can be used to find commit that introduced a bug (something that is probably called "diff debugging" in Subversion), or that Subversion users do find such commit using "svn annotate" / "svn blame".

      Alternatively, assume that you have some kind of commit review in work; that each change or set of changes has to be accepted into 'trunk'.

      In any of those cases you would want for your commits (revisions) to be small, self-contained, and do not introduce "non compilable" state (state which perhaps does not contain searched for bug, but it cannot be tested). But who creates his/her commit (revision) history perfectly on first try? To be able to work on and re-edit series of commits rather than a single large commit, you need separation of the act of committing (for safety) from the act of publishing. And this is possible only in DVCS.

      Therefore I think the cognitive burden of having to do "push" after a "commit" (or multiple "commit"s) is worth it.

      While it is true that Git and Mercurial DVCS were created with heavily distributed development of Linux kernel in mind, I think that hierarchical repositories can be of use also in corporate development. Think about code promotion steps corresponding to separate repositories, and about not requiring to give "commit bits" to less trusted developers: consultants, interns, junior developers.

      "Going dark" is in my opinion just going too far with what I think is a positive thing: not littering central public repository with work in progress, proof of concept, failed attempts, etc. commits; all of which needs to be version controled, but rather should not be made visible.

      ReplyDelete
    4. In my opinion, SVN is over. You left out the issues od speed, file handling, the tedious ignores system, the .svn folder litter, the issues with file copy/move, only to name a few.
      While I would not advocate git, since its usability is best described as hostile, there are good alternatives such as recent Mercurial.
      And the central repository as enforced by versioning system argument only shows the lack of corporate
      governance in place. Plus with offline comits you end up with a poor mans dvcs.
      SVNs place now is aside CVS in the museum of deprecated technology if you ask me.

      ReplyDelete
    5. We switched our project from ClearCase (I know, I know) to Git about a year ago, and we're never going back.

      Your article makes some good points in regards to "going dark" and such, but in practice I don't think this will be a problem in any reasonably managed project. People would notice that this person "never" pushes his or her changes out.

      In my mind, your article doesn't touch on the #1 reason for going for a distributed VCS: Performance.

      We looked at various alternatives, including Subversion, Git and Mercurial. Subversion was discarded for mainly the same reason as ClearCase: It's slow! Everything you do needs to talk to the server.

      In a DVCS, you have the whole project history locally, which means you can do pretty much everything without touching the network. You can show logs, diff versions, check out a different version, and so on and so on. In our measurements, Git was often two to three orders of magnitude faster on these operations than ClearCase, purely by virtue of having all the commits and history on the local harddisk.

      Until you start working with a DVCS daily, you can't really begin to appreciate how much this matters. Previously, whenever doing a commit etc. we would have to wait anywhere from a minute upwards, and what happens then? You start reading mail, blogs and so on, and next thing you know you have completely lost the coding flow.

      In addition, by requiring a server connection you have introduced a single point of failure. We once had three days of downtime on our ClearCase server due to a faulty disk controller, and progress in the project ground to a halt. With Git we could have happily gone on working, and even synced directly with each other if needed. Once the central repo was up again we would just push everything there.

      Regarding your last point against DVCSs: Are you aware that (in Git at least) you don't need to specify the whole SHA1 hash to reference a commit, only as many characters as needed to make it unique? Even in large projects like the Linux kernel, the first 5-6 characters of the SHA1 should always be enough. So in your example it would rather be comparing 7838 to 1438e8. Doesn't seem quite that bad to me.

      ReplyDelete
    6. Regarding the "Distributed" versus "Centralized" argument, I think you're missing the point that Linus' repo is the center of linux kernel development. He may be pulling from various downstream lieutenants, but he's effectively acting as the server hosting the master repository. Anyone who wants to pull pulls from Linus. They check in lower downstream, and code works its way upstream gradually. If you recall the DVCS University presentation, it should sound very familiar. Saying that Linus' repo is more "distributed" than a traditional SVN server is a misnomer.

      Secondly, I think you've glossed over the two biggest benefits that Mercurial (et al) provide over traditional source control.

      The first, as has already been mentioned, is performance. Mercurial is super fast, and commiting, logging, and diffing on the local repo are tremendously performant. SVN is slower by orders of magnitude in this regard. There's absolutely no contest here.

      The second is that I get the benefit of revision control regardless of whether I choose to push upstream. I.e. I get versioning on my machine, without anyone else ever seeing my code. This gives me a chance to test, iterate, and roll back before I involve anyone else in my code, which means I don't break stuff.

      With Mercurial, tasks like the second point above (and the scenario with Linus in the first paragraph) are completely revolutionized -- not because Mercurial gives you tools (diffing and merging) that you didn't have with SVN, but because they make a concept that was complicated (upstream and downstream repositories) absolutely effortless.

      ReplyDelete
    7. DATE: 2011-02-15
      SVN 1.7.0 is feature complete.

      Now lets have a look at your list and what was done

      * WC-NG: New metadata library. While it fixes the .svn directory littering in the workspace, there is no mentioning about better merging

      What else? I don't see anything else...

      ReplyDelete