A few months ago I read about Anthropic's experiment of releasing a swarm of agents to write a functioning C compiler and it made me wonder about the feasibility of using that same approach to accomplish something I've been dreaming about for 15 years now, not long after we started GitHub - rewrite Git from scratch to be library based.
Git is of course a very complex piece of software. There are lots of "plumbing" commands, lots of higher level commands - it was put together incrementally by thousands of people over the last 20 years for projects large and small. It was never based on a linkable and reentrant library, but instead on a "Unix" philosophy of chaining together simpler commands, which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
However, interestingly, there is a very comprehensive test suite of over 42,000 tests in more than 1,400 scripts in the Git project that define pretty solidly how everything should and should not work.
What if we used the same basic idea that Anthropic used on their from-scratch C compiler? Start a brand new implementation, design it as a Rust library, then throw a swarm of agents at the problem and just keep pounding away at it until all the tests pass?
Well, I did that for the last few months, on and off, and the result is Grit, a from-scratch, library-based, memory-safe, idiomatic Rust reimplentation of Git that passes over 99% of the entire Git test suite.

Achtung! While Grit passes the tests, it's not tested. Nobody has used this for anything real yet. If you play with it, be warned that there is a high probability currently that it will do the wrong thing and may even corrupt stuff. Use at your own risk. But if you find something, let us know so we can fix it.
Why would anyone do this?
Unlike the Anthropic experiment, this wasn't just to see if it could be done. When we started, I figured that if it worked, we might have something actually pretty useful in ways that C Git is problematic. But we'll get to that in a minute.
What was I trying to get out of this?
What I didn't want was a pure port of C Git in Rust. In fact, the more I dug into this, I'm not even sure I should have replicated every decision that has ever been made in Git, but that's something we can work on now that the original goal is accomplished.
What I did want was a pure-Rust core library that can faithfully interact with Git repositories, canonically. Reentrant, linkable, modular and comprehensive. Then as a way to ensure the comprehensiveness, an independent crate implementing a CLI surface that uses that library in order to pass as much of the Git test suite as possible.
This is what we've done.
Is it perfect?
Well, no, but it is interesting and arguably already useful.
First, some caveats.
It's not actually passing every single test, though that is on purpose. I did mark some parts of the testing suite as "skipped" because I don't think it's worth recreating them in a library like this - email related stuff, i18n, perforce/svn importers, some of the midx/bitmap stuff - things of that nature. However, for everything that I'm sure is relevant to nearly anyone reading this, the Grit library/CLI can now fully pass the Git test suite.
Does this mean that it's perfect? Nah. It's still pretty slow (in some cases exponentially), there are some untested things that it can't do, the API isn't super clean, there is no Windows build, etc. This is the first milestone of a first pass.
However, it's a pretty interesting starting point for a few months and a few billion tokens worth of work.
Show me the money!!
What could we do with this? It was an interesting project, but we didn't do it just to see if it could be done. I think that Grit can easily be developed into something pretty useful.
One of the main things I would like to be able to use it for is to be able to bundle complex push/fetch functionality into GitButler and other standalone Git tools needing network functionality (such as Jujutsu).
Currently both Gitoxide and libgit2's networking functionality is either partial, slow or non-existant. Both GitButler and Jujutsu rely on forking out to Git in order to push or pull data. A big reason for this is the incredibly complicated credential logic involved, but all of this is (theoretically) currently covered in Grit.
Another possible use case is a WASM build that could be used to do a super wide range of interesting things. Run nearly any Git command in an edge Vercel function for example. Or maybe you could build things like Cloudflare Artifacts without relying on partial implementations like isomorphic-git but instead a fully compliant WASM build of Grit.
Having parts of Git as discrete, embeddable slices of library also enables things like building custom Git servers or client functionality in Rust.
Embed all of Git (or perhaps just the Git you need) at a known version, natively, into things like agent desktop builds or editors like Zed.
The full build of all Git functionality in Rust is currently around 27M, but since a large part of it is a library, it could clearly be easily split up into domains of functionality - subcrates that do specific things. Perhaps you could simply use the subset you need.
Not all of this is possible with the Grit of today, but I believe this milestone proves that it's definitely within reach with a little more work.
You had me at WASM...
Safety
Before we did into the details, it's interesting to note that nearly all of the code is memory safe.
There is essentially one module (date/time) that must talk to C via FFI glue, plus one TTY check. (Apparently, there's no pure-Rust equivalent for localtime_r / strftime / mktime honoring the TZ environment, so it seems that FFI for that is unavoidable.)
Everything else in Grit is safe Rust.
How did we build it?
It's actually been a bit of an agentic little journey. I thought at first that I would be able to define some sort of agent file and run a bunch of agents in loops that would get everything done in a few days.
Turns out it does not work that way, at least on a project of this complexity. Or perhaps it's that I'm not very good at it, but it's a pretty expensive lesson to learn and pretty frustratingly non-deterministic, clearly.
Instead of a rundown of every step of this, it might be more interesting to just do a TLDR of some things I learned when working on this.
Agents love to cheat
If you're telling an agent "make these Git tests pass", it's very tempting for the agent to write a simple function that just passes through to Git to do it. Took me a couple times of seeing way too many passing tests too quickly to realize it was cheating and refine the AGENTS file to really get it to stop doing that.
It's like giving wishes as a genie. You gotta be super explicit with the ground rules. No wishing for more wishes, dammit.
One good example: for sha256 support, there are a few places that test this, but if you implement just for the tests to run, the LLMs realized they could pass without actually implementing sha256 support.
When I realized Grit actually didn't work at all in a sha256 initialized repository and asked Claude about it, this is what it reported:
A handful of tests genuinely run
git init --object-format=sha256(t0001-init, t1900-repo-info, t0610-reftable-basics). These do run and do pass — but look at what they assert (t0001:519-587): every one only checksrev-parse --show-object-formatreportssha256, i.e. that init wroteextensions.objectformat=sha256to the config. grit's init.rs handles that metadata correctly (is_known_object_format accepts "sha256", lines 869-870). None of them ever add, commit, or log in that repo.
So it looks at what is tested and makes that work while actually just doing normal sha1 stuff. It doesn't actually think "I should probably implement actual sha256 support".
Agents don't know when they break things
The reason this project took more than a few weeks is that I nearly gave up halfway through because one of a group of parallel agents broke a fundamental part of the testing harness and it looked like a massive regression.
I thought that too much parallel work was causing more harm than good and maybe the project was just impossibly expensive to accomplish, so I gave up on it almost entirely for a time.
Here is the rough timeline from April 1st, when I started, to today with the percentage of the test suite passing over that time.
The purple bars are the numbers of commits per day, so you can see the two tranches of effort - a week-ish in early April and a week-ish in early June.

The dotted line is the reported percentage passing. You can see a huge drop in mid-April, which caused a drop in my interest in the project.
In early June I picked it back up again to try to salvage the work to make it do something simpler and in working on it one of the agents found the mistake and fixed the testing harness and jumped the passing percentage back up around 80%, which prompted me to try to finish it.
It's surprisingly difficult to long term multitask
I've done stuff with OpenClaw and Ralph before. I've also run parallel agents on lots of occasions. What is more difficult than I anticipated is the combination of long running and parallel.
Coordination
It's probably not a common problem, since it's pretty expensive to try to do, but having some shared task list that several or dozens of long running agents can hack away at is very difficult, I found.
Especially if you want to drive it a bit too. In this case, you need to pause it, merge everything, change direction and then spawn the team again.
I mostly tried using a plan file with checkboxes that is shared, but it's pretty messy. Probably something like Linear or GitHub issues would be a better way to do coordination, but it's slower, requires network access, authentication and tooling on every client.
At the end of the project, I started using my Ticgit local ticketing system project so the task list can be easily modified locally and moved around with Git, but that's a different blog post.
Resource management
I probably should have run this on some beefy server somewhere but instead I did it in lots of different places - my laptop, my Mac studio, a Hostinger slice, Cursor cloud agents, etc. The first three each had resource issues at various loads of parallelism. Turns out compiling Rust can get a tad hungrier than I anticipated when you're trying to do many at a time.
The agents themselves tend to be pretty good at debugging and fixing the problem (ran into swap thrashing, cpu thrashing, etc) but it changes occasionally and was more difficult to manage than I anticipated. Anthropic did their compiler experiment in containers, so maybe some systems planning beforehand would have been a better idea than my yolo approach.
Handoff
Handing off work in progress was constantly an issue. Since I was doing things on multiple systems, some of which was on my laptop so I could actually run and test it, it would have been nice if I could have bundled up where I was and taken it over somewhere else easily.
While some agent harnesses do equivalents of this to various degrees, because I was using several providers (gotta use those subscriptions), there were still a lot of frictions here. This is something we're working on providing with GitButler at the VCS layer rather than the harness lock-in layer, so stay tuned for that.
The expense and token usage can add up. Quickly.
You really have to be careful.
I have no idea exactly how much I spent between Cursor and Anthropic (the main providers I used) but it was probably somewhere around $10-15k.
My approach changed several times, partially as I saw difficulties making things work in parallel, run long, and noticing cost:test-passing ratios change.
I did a lot using OpenClaw with Claude Code via API usage, which was pretty expensive. A huge amount of the work was then done using a bunch of short-lived Cursor cloud agents running composer-2 on single test files. I sort of wrapped up the project using subscription tokens on all systems (Codex, Claude and Cursor in parallel, minding limits).
Token wise, a rough estimate of what this project took would be:
- Claude Code: 14B tokens
- Cursor (GPT/Codex): 12B tokens
- Cursor (composer-2): 16B tokens

The data is all over the place and intermingled with other projects, so it's a little difficult to be precise, but let's say roughly 45B tokens in total. Interestingly, almost half of the project was completed with Cursor's composer-2 model via a ton of short-lived, focused cloud agents.
Some fun approaches
As I said, I took a lot of different approaches to tackling this problem over time. Here are some of the more interesting stabs at the problem and my experience with each.
OpenClaw + Claude Code
I did a lot with OpenClaw running Claude Code subagents at first because I was Ubering around the Bay Area for several hours a day when I first started the project and this was a pretty good way to run it remotely.
However, since I had to use the more expensive per-token API, I ended up spending basically the majority of the cost of this project in just a few days.

In addition to the cost, it was pretty difficult to keep running. I had memory and CPU issues on machines, the Hostinger one just died at some point and I really struggled to get it back online, etc. It was very brittle.
Cursor cloud agents
When I realized that the project would not be done without spending many tens of thousands of dollars more, I changed strategies to utilize some subscription tokens and less expensive models.
The one strategy that probably ended up doing a majority of the work on the project was to spawn a Cursor cloud agent for each file I wanted to work on, then merging them as each was completed.
The problem I ran into here was it was a very manual process. Turns out when you're building an alternative to Git, some of the tests mess up the environment and then try to use your binary instead of Git, or messes up the credential store that the container's Git access depends on. Which means that the Cursor agent can't use Git to push the generated code out of the container. This is a problem very specific to this project and a huge shout-out to vmg and the Cursor crew for helping me figure out what might be happening.
I never quite figured out how my tests were overriding the env in a way that messed up pushes, but it meant that for a lot of them, I had to get a terminal on the container, add a remote manually and push the commit out. I spent a lot of time manually clicking, copying/pasting - sometimes for a 3 line Rust change.
It wasn't ideal, but it did end up getting a lot of work done in parallel (because I was running so many of them).
Cursor cloud grind mode
This is actually my preferred way to do this, after everything I tried.
I didn't know about this, but our awesome partner Matt at A16Z told me about it during my podcast recording there.
You can start a Cursor cloud agent and put it in "Grind mode" by selecting "Long-running" in the model selector. Then it will write up a plan and that thing will just keep going and going, grinding away at that prompt until it thinks it's done.

A lot of work on this project was done by me just saying "make all the t1 test family pass" and waiting a day for it to fill up a PR with 100 commits. Super awesome.
/goal mode
The /goal modes in Codex and Claude Code also do this sort of thing, though I found them to be much slower, and Claude's somewhat confusing and seeming to get stuck a lot.
Codex running in /goal mode seemed to keep working and getting stuff done. Claude seemed to often just hang there and not do much until I intervened. Never quite figured out why, but I started avoiding it.
Claude dynamic workflows
This last week, to finish off the project, I did try out the brand new Claude dynamic workflows in "Ultracode" effort mode.

I do rather like this format for splitting up a big task like "get all the t1 family of tests to pass", though you have to be careful about resource management here too - it will happily thrash your CPU/mem with rustc parallel builds and slow to a halt until you ask it to figure out why it's so slow.
Since I ran out of subscription tokens in the last few days on both Codex and Cursor, I ended up finishing the last few percent of tests over the weekend with a series of dynamic workflows in Claude. If setup properly, it can workhorse through a big, complex list for many, many hours pretty intelligently.
Directed approach is better
One last little hint of advice to wrap things up.
There was a part of me that just wanted to tell a swarm of lightly coordinating agents to just pick a test file and make it work, but I feel like I got the best, fastest gains when I directed the agents to work the way I would have.
Instead of just saying "grab the next test and keep going", it was better to write out how I would approach the problem if I were personally rewriting the project this way - start with the basic plumbing commands, then the next most important commands that depend on them, work from the bottom up. For instance, you don't need to do things like diff formatting output until the very end because it's not really used by anything else. Once you have the basics working, refactoring on top is relatively easier.
Work out how you would approach the problem in detail and then give it over in steps. Every time I deviated from that to try to massively parallelize and not have to think things through, I ran into issues and got bogged down.
License
This is an interesting point that probably also deserves it's own blog post.
The Git source code is GPL licensed. The libgit2 code is GPL with a linking exception, since linking is the whole point and people wanted it to be used.
I probably should have pushed harder for libgit2 to be more permissively licensed 20 years ago when nobody cared that much and it was sort of up for debate, and I think most contributors to that project since that time would have preferred it (licensing continues to be an issue to this day), but alas I didn't want to put the energy in at the time.
In looking at the code that the LLMs have produced for the project, especially given the pretty massive and widespread architectural changes needed to make the implementation libified and memory safe, we decided that the codebase is not a derivative work that would require carrying forward the GPL license and have decided to release the code under the MIT instead.
This might be a little controversial, but ultimately I think it's defensible and more importantly, the best thing for the wider Git community.
Finally
I spent a few weeks in April on this, then paused the project, then finished it off in the first week of June. In total, I probably spent a few hours a day for a total of two or three weeks on this - most of the time stuff was just running in the background and I could do other things, so it was mostly steering, integrating or figuring out what was going wrong.
At the end of the whole experiment, we ended up with:
- 360,000+ LOC
- 100k in grit-lib, 260k in grit-cli - this is roughly similar to LOC in C Git if you exclude headers
- 500+ pull requests
- 7000+ commits
- 41,715 / 42,001 tests passing (99.3%)
A pretty fun experiment and I think we can shape this into something truly useful to the whole community.
If you want to try out Grit, check out the details on the project's homepage: https://grit-scm.com, and if you're interested in future progress, keep an eye out here for further developments!

Written by Scott Chacon
Scott Chacon is a co-founder of GitHub and GitButler, where he builds innovative tools for modern version control. He has authored Pro Git and spoken globally on Git and software collaboration.



