Summer of Code 2009: Fast Darcs

It’s not so much of news anymore, but my SoC 2009 proposal got accepted. Apart from other things, it means that I’ll be reporting semi-regularly to this blog about it. This first post is sort of an introduction.

About the project

My project revolves around the idea of fast darcs for medium and large repositories. Three are quite a few haskellers who use darcs in their day to day (haskell) work. A fair number of hackage packages is maintained in darcs. Even though many of these repositories are of a relatively modest size, there is a number of relatively large real-world darcs repositories out there. The primary target of the project is to improve scalability of darcs for large working trees. This should help those users with existing large darcs repositories, as well as encourage people to use darcs for larger projects, whenever the development model fits.

Unfortunately, this work alone is not sufficient to make managing a large project with darcs all smooth sailing. There is a number of improvements that needs to be made, not the least in the darcs core. Nevertheless, this project should bring darcs one step from “not recommended for large projects” to “reasonably well suited for large projects” where we would like to be a few years from now.

And excitement? Hard to tell. The project is sort of boring. In fact, it’s designed to make darcs less noticeable in day to day operation. What might get you, as a haskeller, excited is probably that I intend to make the darcs working tree handling comparably fast to git. And then, git is written in C, hand-tuned for a specific operating system. And unlike mercurial, I do not plan to introduce a C library for low level routines. So let’s prove that Haskell is up to the challenge.

Optimising darcs

Darcs is known to have performance issues when repositories extend beyond certain size. There are several “directions” in which repositories grow — out of these, size of the working tree and length of history my primary focus. Common commands involving the working tree should be comfortable to use and should not introduce delays in workflow. Moreover, big repositories should be stored efficiently and it should be possible to efficiently retrieve both full repositories and new patches remotely.

Goals

To formulate some tangible goals, we should establish some context. In darcs, operations that involve working tree usually need to compare this working tree to its pristine cache, to produce a patch, to either show it to user, or to record it, or to revert it, and so on.

Implementation of this pristine cache is crucial for performance. Producing an implementation of pristine cache that would allow high performance comparisons is first and foremost of the goals. However, even though speedy execution is important, robustness is absolutely vital. Pristine corruption can directly lead to producing corrupt patches which might cause serious issues to repository integrity, possibly even leading to serious data loss. The “hashed” pristine format used by darcs 2.x is resilient against this kind of corruption, and the proposed pristine format is required to keep this property.

In addition to using the pristine cache for local operations, the cache is also retrieved when a copy of remote repository is being made (through “darcs get”). This means that it needs to be possible to efficiently obtain the pristine cache through HTTP requests, and this is another requirement for our implementation.

Apart from pristine cache, there are quite similar requirements for patch storage. This partially relates to the “long history” kind of repository, and it would be beneficial to share as much code between pristine cache and patch storage as possible. The second goal is to provide such patch storage.

The third goal is to make the code reusable in a way that would make it possible to integrate it with the Camp project, that is being worked on by Ian Lynagh. Camp makes improvements in core algorithms for patch manipulation, and should help address several other of the “size” problems darcs currently exhibits (number of active branches, conflicting branches, conflict resolutions and such).

Apart from working tree and repository format in general, there is plenty room for performance improvement in other areas of the code. If time’s left after addressing the primary goals, it will go into optimising neighbouring areas of code: patch application comes to mind (the most expensive operation in local “darcs pull”).

Design details

Proof-of-concept implementation is being actively worked on. The darcs repository can be found at http://repos.mornfall.net/hashed-storage (and http://repos.mornfall.net/gorsvet for a darcs-compatible experimental client). The current iteration of the code comes within a factor of two of equivalent git operation (whereas original darcs 2 code for hashed repositories falls between factor of 15 to 20 when compared to git). The first mentioned repository contains implementation of the working tree indexing code. The code is partially haddocked, in near future I’ll be mainly working on design documentation (inline in the code). Please don’t hesitate to ask if you have questions!

Some initial benchmarking and commentary can be found in following darcs-users@ posts:

http://lists.osuosl.org/pipermail/darcs-users/2009-January/017521.html http://lists.osuosl.org/pipermail/darcs-users/2009-February/017822.html http://lists.osuosl.org/pipermail/darcs-users/2009-February/017862.html

Note that gorsvet is primarily a benchmarking and testing platform.

The generated haddock documentation for hashed-storage can be found at http://repos.mornfall.net/hashed-storage/dist/doc/html/hashed-storage/. Please note that this is work in progress.

Deliverables

A debugged and documented hackage library for hashed storage manipulation. A repository format for camp based on such a library, with integrity guarantees of hashed pristine and patch storage (camp lacks those with its current format) and with performance within factor of 2 from git, for working tree operations.

I intend to patch darcs to use the hashed-storage library throughout — almost certainly after the 2.3 release — and the improvements thus become part of the current stable 2.x series of darcs.

Benefits to Haskell Community

Darcs is currently one of the most widespread version control tools in use by Haskell projects. There have been numerous complaints, not least from the GHC team, about performance issues. Alleviating these issues would certainly benefit a number of Haskell projects (and a number of non-Haskell projects).

Moreover, since we intend to compete directly against git’s performance, darcs can become a showcase of high-performance Haskell programming, rivalling C in IO and system-interaction intensive workloads, in very real-world conditions.

Last, but not least, we use bottom-up design. The foundational code for the project will become a standalone hackage library.

Why?

I am a long-time darcs user. I believe that darcs has lots to offer over more traditional history-oriented version control systems. Both of these motivate me to work on darcs. The first one is clear — I want my tools to work well and not get in my way. Even more importantly, in team projects, I hate when darcs gets in the way of other team members’ work. And second — I would hate the darcs-style tools to die out. Darcs really brought innovative features to the table — and while the easy ones were picked up by other tools (take the interactive record for example) — the hard, and more important, core features never made their way there. I understand there are practical issues that prevent such features to find place in the rigid DAG-based models of the traditional DVCS.

Since I already use darcs, to me it makes more sense to improve darcs (and potentially camp), than it makes to try bolt the darcs core features onto a DAG-oriented DVCS — a task that those DVCSes themselves do not dare to tackle. Moreover, Haskell is sexy.