This is a story about me trying to write up a somewhat obscure Git feature and falling down a long and winding rabbit hole that ended in the worlds tiniest patch to the Git codebase. Strap in.

What is bundle-uri?

This all started with me trying to figure out the state of a feature that I had heard about from GitLab's Gitaly project called bundle-uri, which allows you to, in theory, speed up a clone by downloading a cached file that seeds your project's data before doing a server CPU-intensive fetch.

In a nutshell, when you clone a Git repository, the git clone command will start a conversation with the Git server you're connecting to in order to figure out what all is available on the server and what it would like to get.

Most clones for a repository end up with very similar data transfer, but every single time, the git client and server are doing this (sometimes quite complex and expensive) negotiation dance.

A few years ago (v2.38 in late 2022), Git gained the ability to provide a URL to a pre-calculated starting point for the repository that can be served from a simple HTTP file server, which means this seed data can be on a fast, distributed CDN.

Once you have that seed data, let's say 95% of your project data, it will then do the more expensive negotiation dance with the server to figure out what the remainder is, possibly saving a lot of time.

What's more, you can even provide a local file as the starting point, say something in a filesystem that a VM mounts or cloud cache, which really speeds things up.

Does it make clones faster?

The simple answer is yes, no, and maybe.

yes?

If you use the local file option, it can make clones much faster. This could be a situation where you're firing up a VM that has a mount point and want to do a fresh clone of a repository each time.

You can point it at some bundle file on the mount as the starting point and then the subsequent fetch can be much smaller and faster while still being completely up to date with the current state of the server.

no?

This is where I went down the rabbit hole. In theory, my understanding was that if I bundle up a repository and stick that file on a fast CDN that's very globally close to me (Berlin) and then use this feature, it would have to be faster than a fresh clone from GitHub or GitLab.

It's the same amount of data, closer to me, served from a static HTTP server rather than a dynamically generated packfile from a forge's Git server process. How can that possibly not be faster?

But it wasn't.

I started by taking a fairly large repo with lots of branches and tags, the GitLab community edition and cloning it as a benchmark.

❯  time git clone https://gitlab.com/gitlab-org/gitlab-foss.git g1
Cloning into 'g1'...
  remote: Enumerating objects: 3005985, done.
  remote: Counting objects: 100% (314617/314617), done.
  remote: Compressing objects: 100% (64278/64278), done.
  remote: Total 3005985 (delta 244429), reused 311002 (delta 241404), 
          pack-reused 2691368 (from 1)
  Receiving objects: 100% (3005985/3005985), 1.35 GiB | 23.91 MiB/s
  Resolving deltas: 100% (2361484/2361484), done.
  Updating files: 100% (59972/59972), done.
(*) 162.93s user 37.94s system 128% cpu 2:36.49 total

OK, it's downloading 3 million objects, building me a 1.3G packfile and downloading it at 24 MiB/s (thanks Berlin) taking a total of 2 minutes and 36 seconds, all in.

Let's try to speed this up by putting all those objects into a packfile, using the newer bundle-uri option to git clone to pull those from a CDN, then doing what should be basically a no-op fetch.

The first step is to bundle up the objects.

Git Bundle

If you're not familiar with bundle files, it's actually an incredibly old command that was introduced in Git 1.5.1 (in 2007 - 18 years ago) called git bundle, the purpose of which, as stated in the release notes, was to make "sneakernetting" easier - putting the repository on a USB stick and walking it around the office.

Basically it produces a repository in a file - sort of a packfile with a list of references as a pre-header. You can read all about it in my old Pro Git book if you want to dig into them a bit more.

So, let's make a repository in a file of the current state of the GitLab codebase, which we can do with git bundle create [file] --all.

❯ time git bundle create gitlab-base.bundle --all
Enumerating objects: 3005710, done.
Counting objects: 100% (3005710/3005710), done.
Delta compression using up to 8 threads
Compressing objects: 100% (582467/582467), done.
Writing objects: 100% (3005710/3005710), 1.35 GiB | 194.33 MiB/s, done.
Total 3005710 (delta 2361291), reused 3005710 (delta 2361291), pack-reused 0 (from 0)
(*)  17.31s user 3.03s system 84% cpu 24.199 total

OK, now we have a single binary file that has all 1.3G of our 3 million git objects.

We throw that on a CDN (in this case Bunny CDN, which has a hop in Frankfurt) and clone again with git clone --bundle-uri=[file-url] [canonical-repo]

❯  time git clone --bundle-uri=https://[cdn]/bundle/gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git g2
  Cloning into 'g2'...
  remote: Enumerating objects: 1092703, done.
  remote: Counting objects: 100% (973405/973405), done.
  remote: Compressing objects: 100% (385827/385827), done.
  remote: Total 959773 (delta 710976), reused 766809 (delta 554276), 
            pack-reused 0 (from 0)
  Receiving objects: 100% (959773/959773), 366.94 MiB | 20.87 MiB/s
  Resolving deltas: 100% (710976/710976), 
            completed with 9081 local objects.
  Checking objects: 100% (4194304/4194304), done.
  Checking connectivity: 959668, done.
  Updating files: 100% (59972/59972), done.
  (*) 181.98s user 40.23s system 110% cpu 3:20.89 total

The first thing that you'll notice is that this actually took more time - 3:20 vs the fresh clone's 2:36.

The next thing to notice is that it still has to download 959,773 objects, which was 32% of the original clone. So, fewer objects in that part, but since it downloaded all of the objects first, why did it need another million objects still? 🤔

maybe?

It took me quite a while of testing and digging, but it turns out that the reason is that when Git unpacks the bundle file, it only copies the local branch references out. So if you pack with --all, indicating that you want every reachable object, it will create a big packfile with everything.

However, when the bundle is downloaded and unpacked, it will only actually use the objects that were pointed to by a local branch (which, right after a clone is probably only master/main) to negotiate with the server for which objects it still needs. Everything else appears unreachable.

This means that the 1,000,000 objects we fetched were all re-downloaded - we actually already had nearly all of them, but Git wasn't aware they were there.

So, I dug into the code and found where this is happening.

It turns out, Git is only copying the refs/heads references (branches) into the refs/bundle space, ignoring all other references in the bundle file.

Why is this important? Because the stuff in refs/ is what Git uses to negotiate with the server about what it has and doesn't have. If we only have 1 out of 1000 of the references we used to create the bundle file, Git will tell the server that this is the only thing we have.

If I change this to copy over everything in refs/ (tags and remote references too) recompile Git and run this clone command again, I get:

❯  time ./git clone --bundle-uri=https://[cdn]/bundle/gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git g3
  Cloning into 'g3'...
  remote: Enumerating objects: 65538, done.
  remote: Counting objects: 100% (56054/56054), done.
  remote: Compressing objects: 100% (28950/28950), done.
  remote: Total 43877 (delta 27401), reused 25170 (delta 13546), pack-reused 0 (from 0)
  Receiving objects: 100% (43877/43877), 40.42 MiB | 22.27 MiB/s, done.
  Resolving deltas: 100% (27401/27401), completed with 8564 local objects.
  Updating files: 100% (59972/59972), done.
  (*) 143.45s user 29.33s system 124% cpu 2:19.27 total

Awesome, now not only is it faster than the original clone (2:19 vs 2:36), it's also only downloading the delta between the data I bundled and what has been pushed to the server in the meantime. In this case, we downloaded an extra 43,877 objects in our fetch, or a mere 1% of the total object size of the repository.

This has resulted in a contender for the world's smallest open source patch:

100 word commit message for a 6 char diff

This patch is in flight on the Git mailing list, but now it's a question of if this is addressed on the bundler side or the clone side. However, hopefully a future version of Git will fix this in some way.

💡

Update (June 18): Git version 2.50 has incorporated this patch, so the "maybe" section is probably less maybe.

Should I use this?

So, there are a couple of answers to this as well - "possibly" and "you won't have a choice".

The real benefactors for this feature are the forges - GitHub, GitLab, etc, because it has the promise to massively reduce CPU server load. If their dynamic Git server processes aren't having to calculate massive packfiles in CPU every time someone clones, but can instead offload huge amounts of that to a cheap, fast, globally distributed CDN, that can save them a lot of money and server resources.

But can it be helpful to you?

possibly

I can see two use cases that can be helpful for normies.

If you're running your own git server internally, perhaps Gitosis or GitLab, this can help reduce server load, especially if you're doing a lot of clones. This is why GitLab has an experimental feature for this.

If you're running some sort of automated setup that needs a full clone often, maybe a CI or automated testing system where you don't want to run a shallow clone for some reason, the local file version of this (reading the seed bundle from a mount or NFS point) could be really helpful.

For example, if I do the same clone from a local bundle file:

❯ time ./git clone --bundle-uri=/tmp/gitlab-base.bundle https://gitlab.com/gitlab-org/gitlab-foss.git g4
Cloning into 'g4'...
remote: Enumerating objects: 69832, done.
remote: Counting objects: 100% (52229/52229), done.
remote: Compressing objects: 100% (21467/21467), done.
remote: Total 33625 (delta 23410), reused 17620 (delta 11061), pack-reused 0 (from 0)
Receiving objects: 100% (33625/33625), 28.39 MiB | 20.50 MiB/s, done.
Resolving deltas: 100% (23410/23410), completed with 9321 local objects.
Updating files: 100% (60218/60218), done.
(*) 93.29s user 16.84s system 189% cpu 58.103 total

Now it takes less than 1 minute for the full clone and I'm totally up to date with the upstream server. (I actually don't know why this takes so long - copying the bundle file should be very fast, but it takes 30s or so - I might want to dig into this next...)

you won't have a choice

The more probable answer is "you won't have a choice", because your Git client will do it automatically.

While I spent most of this article talking about manually specifying a bundle URL in git clone, that is not really how this is meant to be used in practice (other than the CI type use case described above).

In newer versions of the Git server protocols, the server itself can advertise bundle URLs. This means that the server can tell the Git client "go get this seed file first, then come back to me".

Unless Git has been configured specifically to not use them, it will attempt to download bundle seeds if the server tells it where they are. So most likely, if GitHub implements this feature someday, your Git client will start downloading bundle files from an Azure CDN before taking up GitHub server resources on packfile creation.

Hopefully we can get some version of this fix landed (or something written by someone smarter than me), so your client isn't then downloading more data than it really needs to and wasting Azure's bandwidth. 😉

Written by Scott Chacon

Scott Chacon is a co-founder of GitHub and GitButler, where he builds innovative tools for modern version control. He has authored Pro Git and spoken globally on Git and software collaboration.

Website Twitter