Tuesday, October 21, 2008

Designing a mergeable filesystem structure

In my previous post, I have argued that it would be better if filesystems could be merged without human supervision. This could only happen if conflicts were impossible, but with today's filesystems, merge conflicts are actually inevitable! Let's design a filesystem without this flaw.

My study of conflicts has led me to the study of a family of data-structures which I call commutative types. These structures are abstract data types, in the sense that you can only modify them through their public methods. What distinguishes commutative types is the additional restriction that the internal state of the object must not depend on the sequence of operations which have been performed on it so far, but on the set of operations which have been performed. In other words, the order in which the operations are applied must have no impact.

For example, consider an object whose internal state is a single integer. If the only update operation provided were increment() and decrement(), then this object would be commutative. This is because incrementing and then decrementing yields the same result as decrementing first, then incrementing (we're ignoring overflows for now). Similarly, if double() and triple() were the only operations provided, then the object would again be commutative. On the other hand, if the same object allowed both increment() and double(), then that object would not be commutative.

x + 1 − 1 = x − 1 + 1
x × 2 × 3 = x × 3 × 2
(x + 1) × 2 ≠ (x × 2) + 1

Integers are well studied structures for which commutative operations are known, but in order to represent a filesystem, we will need a commutative object with a much more complicated internal state. Let's design one! What are the requirements?

Well, here's an easy one. Suppose we write() a value in a particular file. If we later read() this same file, then we ought to obtain the value we wrote in it. Now, if we read() the file before we write to it, there's no way we're going to read the same value. Does this mean that any filesystem supporting both read() and write() can't be commutative?


Yes and no! It was a trick question. Remember, we're only concerned about the internal state of the object here. Since read() doesn't change this state, it actually commutes with anything. In particular, read() commutes with write(). Unfortunately, this still doesn't make the filesystem a commutative object, because write() doesn't commute with itself...

If we write("foo") to a particular file, then write("bar") to it, then the final state definitely isn't going to be the same as if "bar" was written first. So our commutative filesystem can't even have a write() method.

That's a pretty drastic restriction! We could, however, have a write_xor() method. Instead of replacing the file's contents with its argument, write_xor() would bitwise exclusive-or the content of the file with it. This looks weird and useless, but it would actually produce an okay filesystem. You could still write any value you wanted to the file, you would just need to zero-out the file first, by xor-ing the file's contents with itself. Writing a shorter value in this fashion would leave a padding of zeroes at the end, but we could introduce an end-of-file character to alleviate the problem. All in all, I'm not too dissatisfied with this minimalist interface. And it commutes! You will never see a merge conflict again!

...in theory. But in practice, the result of a merge will be an unreadable mess of binary characters, which is even worse than a conflict. I'm not giving up yet, as I've got plenty of other ideas, but let's keep this so-so idea for now. Because there's another challenge to solve: directories!

If the two branches we want to merge both define the same filename, but one as a file and the other as a directory, what should we do? Should one take precedence over the other? That would work, and it would probably commute, but I have a better solution. What if we kept both? With today's filesystems, every directory's got two names: /foo/bar and /foo/bar/. The first form is usually favoured, out of habit more than anything else, I think. Well, let's not let the choices of the past prevent us from picking the choice of the future today. I'll allow /foo/ to contain both a file named bar and a directory named bar/! And my trees will commute at last!!!

Wait, I do sound more crazy than usual when I say it like that.


Martijn said...

That's really interesting. :-) Thank you for explaining! And I hope you find a good solution to this problem, because, yes, merge conflicts are from hell.

gelisam said...

Many years later, Robin Houston used Category Theory to find a text representation which allows two write operations to commute: