Sunday, August 19, 2012

File names? Who needs them?

While trying to work out a problem with name collisions in TagFS, I recalled a somewhat radical idea I had nearer to the start of the project to completely remove user-specified file names from the system. When I initially thought of the idea it was mostly an extrapolation of the ontology I was building for TagFS. It seemed to me that what defined a file was
  1. The tags attached to the file (naturally)
  2. The actual content of the file
and absolutely nothing more. The file name is a kind of metadata which  has at times served as primary data, but I didn't see it as something innate to what it named---it was merely a convenient tag for the data. I back-pedaled from the idea that the system was workable without a dedicated name (à la -booru imageboards) because several situations where the name is important to how a system works with data came to mind: Makefiles, Java source code, C include directives---programming in general really; but also, file extensions, URIs, simple data organization schemes---and the list goes on. Generally, we rely on files having names and having canonical ways of accessing the files based on their names; and so, I compromised my ideology by making 'name' a tag and storing file names in the 'name' value (possible thanks to an extension to my simple tags that was little used elsewhere). Where ever the file name was required, the name tag was read in behind the scenes. From there, I moved on ahead and let the no-file-name idea drop.

Eventually, the special case of the 'name' tag started to create problems in the organization of my code (it was awful), so that kludge, along with a good deal of equally stinky code was replaced. File names became first-class metadata again and things worked relatively well. While I was working in this new system though, I realized that some situations were not covered by simply preserving file names. When I store a file, it gets placed in a sort of bucket for each of its tags. To pick out a file, we have to specify which buckets we want to look in and what the name of the file is that we want to find in these buckets. If the file isn't in any one of the buckets then we return a "not here" value. This generally works by taking a path like "/tag1/tag2/tag3/file_we_want", translating the dirname into a list of buckets, ("tag1", "tag2", "tag3"), to check and checking for "file_we_want" in each bucket. Our problems start when we have files that share some set of tags and a name. There are 3 cases where this happens:
Let, Name(File_a) == Name(File_b)
  1. File_a has tag set A, File_b has tag set B, A is a subset of B
  2. the same, but B is a subset of A
  3. File_a has tag set A, File_b has tag set B, and there exists a tag set C such that C is a subset of A and C is a subset of B
The case where the tag sets of File_a and File_b are equal isn't a problem since that implies that they are the same file.
In all of these cases, there is a set of buckets where the data that we are seeking can't be gotten by knowing only those buckets, but requires knowledge of which order the files were put in the buckets and how that order affects which file you see, or if we're in case 1 or case 2, which file still exists. There are some schemes where most of the user-supplied names would be preserved, and even one I've thought of where files are stored in a stack rather than being overwritten. None of these seem effective or natural, so I don't want to bother with them.  After thinking over this problem I recalled the no-file-name concept and decided to give it another chance.

Setting my earlier misgivings aside, how does removing file names solve my identity crisis? We still keep identifiers, but we just ensure that they are unique. That's easy because every file that's added to TagFS gets a brand-new id number. By identifying files this way we are guaranteed not to have any collisions, so the precondition for my problem cases can't happen (as long as you make fewer files than a 4-byte integer can hold). The only worthy considerations then are, first, how a human user identifies files based only on tags and an id number, second, and more importantly, how can we do all of the things we did with file names, with tags?

I won't answer those questions because I doubt that anything I could come up with would be as worthy a solution as whatever people would come up with while actually using TagFS. Different workflows would have to be designed for some tasks and file management utilities would have to be redesigned. These aren't minor hurdles to overcome (far from), but they don't present a fundamental challenge to the idea of removing file names.