How git actually stores files

The popularity of git seems to rise and rise. And with its growing popularity I feel that a lot of discussions about version control systems seem to get more emotional and less technical than ever. The choice of something so central in your development process should not be guided purely by emotions and arguments like “It’s good enough for the linux kernel, why should it not be good enough for us”. Especially as the last argument does not take into account any of your usage patterns or what you expect from a version control system. This is by no means a “git is bad” post, I just want to share a little knowledge about git and follow that up with some blog posts about other version control systems like mercurial or subversion so we can actually have a meaningful conversation about where we should go from where we are.

A few years back the company I worked with migrated to git and this was the first time I actually had to manage a larger git repository and deal with all the implications of that. Some month later git fsck showed some errors about duplicate entries in trees, so that was when I actually had to learn how git works. And I stumbled, and then I stumbled some more. A lot of resources make it sound as if git just stored a bunch of files that are just compressed and hashed text files or blobs. There are tons of websites and conference talks that make it sound like you can build a working git client within an hour. But let’s talk about pack files, because this is where the magic of git begins in my opinion.

First, let’s acknowledge that the git object model is that simple. There are blobs, which are just the compressed content of a file you added to your repository. Then there are trees, which are just the list of files and folders (which are stored as trees again). And finally there are commits, which point to a certain tree and represent the repository content at a certain time, containing some meta information like who made this certain commit and when. Yes, there are more things, like branches or tags, but let’s keep it as simple as possible for the moment. This structure basically represents the git object model or a logical view on the data and it is just that simple. The files stored this way are called loose objects.

But now we have to look at repository size. If we would just store every file just as the compressed content of the file we would waste a lot of space. Stupid example: We have a 10MB text file, which compresses to 3MB. Great, we just “won” 7MB. But the next time we change something, let’s just say we just change a single word, we have to store those 3MB again, over and over again, for every change. This would be really inefficient, right? Same for trees, usually there is not much that changes in a tree, maybe a file was added, maybe a file was removed, but a lot of the time there are no big changes to tree objects when working on an existing codebase except that the hash of blobs changes.

This is where the garbage collector of git kicks in. It takes all those loose objects which are just the compressed content of either a file or a tree or commit (which are just plain text) and packs them into one big file, called a pack file. This magically makes all our space and efficiency problems go away, so garbage collection in git is certainly a very good and very important thing. Sadly, none of the git libraries for the programming language of your choice, that I have seen so far, implement garbage collection. Which means, that if you use the git client of your choice, and use it exclusively, there is a not so slight chance that it might never trigger garbage collection and might very well let your local repository grow to unreasonable sizes and unreasonable slow speeds. So if you are one of those that get a fresh clone every now and then to make it small and speedy again, a git gc on the console is probably faster and better 😉

Alright, so how do these magical pack files look like? The official documentation is here. It is actually pretty good, maybe a little too short and without real examples, but I managed to implement a pack file reader in a reasonable amount of time just from the documentation. Except for the extensions for pack files larger than 4GB, there I had to read some source-code of other implementations as something was wrong with the offsets in my implementation, which took some time. Especially as I really wanted to implement it with just the documentation, which I finally failed at. But the point is: It is much harder to deal with pack files than just dealing with loose objects and anybody who claims to be able to teach you how git stores all its files and write a client for that in like an hour is most likely only talking about loose objects and is not talking about pack-files at all.

Each pack-file is accompanied by an index file, which stores the offsets to each object in the pack-file. As the object hashes in the index file are sorted we can do a binary search on it to retrieve the offset in the pack file pretty fast. After we have the offset we can look into the pack file at the given offset and retrieve some data about our object. There is information about its type (like blob, tree, commit), the size of the data to read and if it is a delta or not.

This is important, here git works with deltas! Conceptually we are always only dealing with blobs, trees and commits, but on the file system in our pack-files we have deltas. You can view these deltas as an implementation detail, as this is the only place where they are relevant. If we look back at our example: We have our large text file where we only changed a word, then git only has to store the delta on the disk. Not the 3MB compressed data, but just the one word we changed and the instruction how we restore the object from its base object. Which may as well be another delta and so on. Git does indeed only store deltas against the same type of object, so no delta from a blob to a tree or something like that. This makes git great in terms of repository size.

Short recap: When adding an object, git stores it as a loose object, meaning just the compressed content of the object, with the hash of the content as file name. When `git gc` runs git will may collect those loose objects and put them into pack files, which store the objects way more efficiently.

But how are those pack files created? Now this is the most elusive part from the documentation. You have to be on point with your google search skills to find anything about this and all you will find is basically a Git IRC Log. And this just underscores the point that this is by no means trivial. No matter how many times I read it at some point I get confused where this is going, but I try my best to get my point across.

It comes down to some heuristics. To create the pack-files with all its deltas the objects are sorted by things like type of the object, the name (filename/directory name) of the object when it was first created and the size of the object in descending order. The sort by type is because objects should not diff against an object with a different type. Sorting by size in descending order is probably to get smaller diffs, as we can assume that then we have lots of removals, which are more efficient to store as additions always need to contain all the added data. The sort by filename is actually not entirely clear to me, maybe someone can clarify this in the comments? Now the base objects are written into the pack file (again, compressed), diffs are generated against the base objects, and so on.

These heuristics are pretty good as most of us see in our daily work. The repository sizes are very small most of the time, network transmissions are pretty fast and all in all git works very reliable. All in all a great version control system with interesting design choices. I will describe other version control systems in a similar manner in later blog posts and after that make a comparison of design choices how to store file history and go into detail on the limitations of each system.

The downside I see with pack files is that it will be very hard with the available documentation to create a fully featured git client that will also create those pack files, at least in the same way git does. Yes, the source code of git is always available, but I often hear people say that git is so great because of the simplicity in terms of file storage and how easy it is to basically write an implementation from scratch, just from the documentation or some YouTube videos. What do you think about this, did you know about pack-files and how they store files differently than the logical git object model (with loose objects)? Am I missing an implementation that actually generates pack-files as well?

Leave a Reply

Your email address will not be published.