The git object model is actually pretty easy to grasp, basically every commit and tree is just a text file containing meta data, which is then SHA1 hashed and compressed with zlib. The filename will be the previously generated hash. So lets create a repository and commit something!
mkdir test
cd test
git init
git commit --alow-empty "first commit"
In this example our commit got the hash 72b96b7ef956e788ace7b40736a952a8c0149755. We can use git cat-file to see the actual content of the commit file.
git cat-file -p 72b96b7ef956e788ace7b40736a952a8c0149755
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
author Tim Heinrich <tim.heinrich.de@gmail.com> 1599894489 +0200
committer Tim Heinrich <tim.heinrich.de@gmail.com> 1599894489 +0200
first commit
Let’s try to recreate this commit using C#! First, to make it easy for us, we will just read the content of the commit into a byte array, hash the byte array, compress the content and write it back. Should be easy enough, and then we will verify that this works by using git cat-file again. The commit file is stored in .git/objects/72. Git uses the first two characters of the hash as folder name, so the file is just called b96b7ef956e788ace7b40736a952a8c0149755. Rename it to b96b7ef956e788ace7b40736a952a8c0149755.copy, now git will not be able to see it anymore.
The first challenge we face is uncompressing the file. There is a DeflateStream in the dotnet framework, but it does not seem to be able to compress or decompress the files git writes. The internet will now tell us to use SharpZipLib, which works like a charm. So we add the NuGet package to our project and start by reading and unpacking the commit object, storing the text in a string, and then just writing it back just the way we read it:
string text;
var pathToObject = "/VCS/gittest/.git/objects/72/b96b7ef956e788ace7b40736a952a8c0149755";
var compressedBytes = File.ReadAllBytes(pathToObject + ".copy");
using (var memoryStream = new MemoryStream(compressedBytes))
using (var inflaterInputStream = new InflaterInputStream(memoryStream))
using (var reader = new StreamReader(inflaterInputStream, Encoding.UTF8))
text = reader.ReadToEnd();
using (var fileStream = new FileStream(pathToObject, FileMode.Create))
using (var deflaterStream = new DeflaterOutputStream(fileStream))
deflaterStream.Write(Encoding.UTF8.GetBytes(text));
We verify that it works by running our previously used git cat-file again, and everything is good.
Now that we have a working version that can write git objects the fun begins! When working on GitRewrite at some point the major performance bottleneck in the application was actually compressing and uncompressing these objects. So what can we do? First thing we saw in the profiler that the streams will create new Inflater instances on every call, which seems to be very expensive. I tried to cache those instances, but then ran into issues as it was not threadsafe anymore. I tried a lot, but really could not get SharpZipLib to perform well. So let’s look for faster alternatives!
I tested lots of libraries to improve the performance, but none of them were delivering the performance I wanted other than zlibnet, so this was my reference. The library had a few problems for me, especially that it is basically just a wrapper around the native windows libraries. This means no linux support, which was a major drawback for me.
Situations like this are always dangerous for me, as I know zlibnet can be faster, so it should be possible to have a similar library in C#. I may get a little obsessed over stuff like this and won’t be able to sleep well until it is solved, one way or another. Soooo, I did a lot of research.
I mentioned DeflateStream earlier, and I put in some work to see what was or wasn’t going to work with this. I compared the generated files of SharpZipLib and DeflateStream, and there was something obvious: The file where I used DeflateStream was shorter, but all of the content of it was contained in the file generated by SharpZipLib, there were just a bunch of additional bytes at the beginning and the end of the file. Which means, we are not doing something wrong, we are just missing something!
This was when I knew roughly what to google for, and finally found out that DeflateStream didn’t calculate a Adler32-Checksum, which was simply added to the end of the file. There seemed to be lots of examples how to calculate it, like this.
Next test, git still cannot read the file… Comparing them again yielded that there were two bytes missing at the beginning of the file. I have absolutely forgotten what they are for, but adding them finally resulted in a valid git object! The final result looks like this:
using (var fileStream = new FileStream(filePath, FileMode.CreateNew, FileAccess.Write))
{
fileStream.Write(new byte[] {0x78, 0x5E});
using (var stream = new DeflateStream(fileStream, CompressionMode.Compress, true))
{
stream.Write(bytes);
}
var checksum = Adler32Computer.Checksum(bytes);
var checksumBytes = BitConverter.GetBytes(checksum);
if (BitConverter.IsLittleEndian)
Array.Reverse(checksumBytes);
fileStream.Write(checksumBytes);
}
This is way faster than SharpZipLib and even faster than zlibnet, which already yielded acceptable results! As a added benefit we even have less memory traffic. Now if this wasn’t amazing enough, I noticed something else to improve on. In the performance profiler now most of the time compressing the file was spend calculating our little Adler32-Checksum. A few search queries later I stumbled about a chromium issue describing something similar. So in the zlib implementation they seem to have some serious improvements to the adler32 algorithm, but this is nowhere to be found for C# developers, so let’s just take the C code and port it to C#! Yes, some of the ugliest most unreadable lines of code I have ever written, and if that wasn’t enough, unsafe code. But believe me, in this case the tradeoff is worth it.
These improvements made GitRewrite a lot faster. I don’t really have benchmarks anymore, but for super huge repositories the time rewriting them went down from like 30 minutes to seconds (sidenote: These are repositories where git filter-branch would take weeks). Of course, writing git objects was not the only improvement made, but certainly the most significant. I hope you like this story about me obsessing over simply writing out a compressed file! Because that’s what it should be, three lines of code, not this… abomination of an algorithms and some magic numbers in the byte stream 😀 Did you ever obsess over somthing that should be simple? Let me know in the comments!