2013-01-20

I work in Microsoft’s Developer Division (devdiv). We are the ones that make, among other things, Visual Studio. Our source control is in a TFS Repository (we make TFS too). A very large TFS repository and likely the largest in existence. I really don’t know how big exactly. I do know that the portion I work with is about 13GB and spread over dozens of workspace mappings. Before moving to this part of Microsoft, I had been a Git and Mercurial user for the past three years and learned to love them. My move to DVCS was more than an upgrade in tooling, it was a paradigm shift that transformed the way I work. I knew I would use TFS moving to devdiv but just figured I would use a bridge like git-tfs. After moving to my new TFS home and spending some time intentionally becoming familiar with TFS, I decided to break out git-tfs and get back my branching and merging goodness. It didn’t work. Then I became sad and then desperate. As I progressed through the various Kubler-Ross stages of grieving, I never did reach the final stage of Acceptance. This is the story of my journey that led me to create BigGit.

My hope for the reader

My hope for you, dear reader, is that this will be just a story and that you will have no personal need for BigGit. Unless you are a coworker, chances are you wont. But I do know there are others out there working in unwieldy TFS repos and wanting but unable to use git. Chances are if you are dealing with a repository this large, you are likely experiencing friction points that transcend just source control tooling.  You may come home each night feeling like your nipples have been rubbing against heavy grade sandpaper all day long. Unless you are one who likes that kind of thing, and I mean no disrespect if you do, hopefully you can apply some BigGit and soothe that chafing.

If you are interested in reading more about my personal experiences and impressions as I migrated from traditional centralized version control to DVCS, please read on. If you just want to quickly learn of a way you can possibly get your large TFS repo under git, then go to http://BigGit.codeplex.com and you will find lots of documentation on how it works and how to use it. I hope it helps.

My Introduction to DVCS

About three years ago, I reluctantly entered the into the world of diversified source control. I had been using SVN and was very happy. It had only been a couple years since I had moved off of Visual Source Safe (VSS) so I was easily pleased. I was finally feeling competent with the basics of branching and merging which were just bastardized and deformed children in VSS but first class concepts in SVN. When my organization adopted Mercurial in 2011, it was clear to me that it was the right decision. It was not difficult to read the tea leaves even then and see that DVCS was clearly the future.

The first month of using Mercurial was painful. I not only had to learn new commands but I had to internalize a different perspective on source control. This was not SVN. Everyone had their own complete repo and not just a “working folder” and then you had to deal with three way merges that took a bit to groc. There were times when I wondered if it was worth it and wished I could have my familiar SVN back. However I got the sense that I was working with something very powerful and if I could master it, I’d have a lot to gain.

I was not alone. In a team room with four developers, I probably ranked #2 in satisfaction. During our first month of ramp up mistakes were made. The kind of mistakes most will make and this post can not keep you from. You will try to do something fancy because you can. Something like a partial roll back or cherry picking commits and you will make some mistake that either deletes history or puts your repository into some weird and esoteric state that takes you hours to dig out of. You will want to blame the new stupid source control tool, and perhaps you will, to make yourself feel better.

Switching to the Command Line

As a source safe, SVN and early Mercurial user, I primarily used GUI tools to manage my source code. As DVCS systems go, Mercurial has a decent GUI. However, as I learned more about mercurial and delved into troubleshooting adventures, I became more uncomfortable with the way that the GUI tools hid a lot of the operations from me. When I click the “sync” button what really happens? After all, there is no Mercurial “Sync” command. I roughly knew about the concepts of pull/fetch/update but the Gui hid the details from me. This was all well and good until something goes wrong and as you delve into the problem, you find that you really do need to understand these concepts if you are going to work in this system day in day out. Perhaps the more casual user can cope with a GUI, but working in an active team of developers on various branches, I really felt a craving to understand how the system worked.

So I opened up the command line, put the HG directory in my path and never looked back. I personally believe this was not only a practice that improved my Mercurial competence, but it was the gateway to command line competence leading me to powershell and making me a better and more productive developer. And when we become better and more productive, we become happier.

When people new to DVCS would come to me with problems, they were almost always using a GUI exclusively and had bumped into something that they can possibly solve with the GUI but will not understand the problem using the artificial and dumbed down abstractions that the GUI enforces. In order to properly articulate the problem and solution, one needs to understand the commands and switches the GUI is calling. I always tell people to put down the GUI and then they look at me with crazy eyes. I love the crazy eyes. After their next couple blunders that we all make, I see that command line on their screen.

Its not so bad the command line. Quickly you find that most days you use no more than a half dozen commands anyways and you find that the GUI does not save much time and as you become more competent, you do use the GUI where it shines and stick to the command line for the rest.

DVCS Bliss: Branching and merging and time traveling

So what made me become such a firm advocate of Mercurial and eventually Git? There are so many things and I’ll quickly list some here:

Most operations are faster because they are carried out locally. Commits, branch creation and switching, examining diffs and history – these are all local operations that do not need to make a server call.

Getting a repo up and running is trivial. When I worked in the MSDN org, the msdn/technet forums, profile pages, search code, gallery platform (Visual Studio Gallery, MSDN Samples, etc) and more were using Mercurial. The “central” repository that everyone pushed to and the build system pulled from was maintained by the development team. Being one of the primary administrators, I can safely say that I spent about 5 minutes a month doing some kind of administrative task and it never went down or suffered performance issues. This is due to its simplicity. Its just a file.

The local repository is portable. I can move it around or delete it and there is no “server” component that cares about or tracks its location.

These are nice benefits but the biggest value to me of DVCS is the ease and speed of creating branches.This is another area where git outshines Mercurial. While branches are just as fast and easy to create on both, you can easily delete them in git while they can only be “archived” in Mercurial. Both git and Mercurial support excellent compression algorithms and merge tracking strategies. Creating branches in either is incredibly light weight from a disk space perspective. Since all branches are local, I can switch from branch to branch without an expensive server call to reassemble the branch.Using Mercurial or git, having 2 1GB branches that have only a few files of differing code will probably take up not much more than 1 GB for both branches. While TFS requires me to keep a full copy of both.

The merging algorithms of Git and Mercurial are also better than either SVN or TFS. One of the things that delighted me almost immediately when I made the move from SVN to Mercurial was that Merge conflicts dramatically reduced. I have had similar experiences with TFS and git. In systems like TFS and SVN, you might think twice about branching because of the overhead and friction they can produce, while git/mercurial makes it so easy that branching is often central to the workflow in these systems. It is very empowering to know that at any moment I can create a branch and do a bunch of experimentation. Later I can discard the branch or merge the changes back to my original branch. It is also easy to share branches. In a single command I can “push” a branch to a central repository where my fellow contributors can pull it and view my changes. Heck, you don’t even need a central repo, you can push/pull peer to peer.

Git and TFS interoperability

Clearly git has become widely popular among the greater developer community. Many developers find themselves using TFS during the day and git in their off hour work. For those who work in organizations using TFS, there are tools like git-tfs or git-tf that allow one to work in a git repository and bridge changes back and forth between git and tfs. This allows one to target a TFS server path and clone it to a git repository. They can do their day to day work in the git repo and then “checkin” to TFS periodically. These “bridging” technologies typically work great and make the transitions between git and tfs fairly transparent. I work on a couple small and isolated pockets of our TFS repository at work and I use git-tf to clone those areas of tfs to git. Once the directory is cloned, I am in a normal git repo. There are a lot of developers who work like this without a problem.

Large or dispersed repositories

There are a couple scenarios where this interoperability does not work so nicely. One is if your TFS workspace has a lot of mappings. Both git-tfs and git-tf can only clone a single TFS server folder. This is fine if all of your mappings fall under a single root folder that is of a manageable size. However, these multi mapping workspaces often have so many mappings precisely because the root is too large to map on its own. In my case, my 500GB hard drive would run out of space before I could finish syncing my root folder.

This touches on the second scenario preventing git to play nice with large repositories. Now large is a relative and subjective term. Lets qualify large as greater than 5 GB. The raw size is not the key issue but rather the number of files. Many of Git’s key operations perform a file by file scan of the repository to determine the status of each file. If you have 100s of thousands of files, this can be a real bummer. Keep in mind that git was built for the purpose of providing source control for the Linux kernel which is roughly about a couple hundred MB. This provides great performance for the vast majority of source repositories in the world today, but if you work in a repository like mine the perf can be abysmal.

Repository Organization

If you look at some large enterprise TFS or P4 repositories, it is not uncommon to find everything placed under a single repository. That works just fine for these systems and it can be easier and simpler to administer them this way. Because these are CVS (centralized versioning systems) based systems, the server acts as the single authoritative keeper of the assets. This server copy is not a file system but a database. In a large enterprise model where teams of DBAs administer the corporate databases, it makes sense to keep everything in a single data store. 500GB is not a large database and putting everything in one repo simplifies a lot of the maintenance tasks. Most CVS systems provide a protocol where the user often specifies the set of files to be acted upon. By doing this, there is no need to do a file scan because you are telling the system which files will be affected.

Of course many git operations can have individual files specified and many TFS commands can act recursively on an entire repository. However key operations like git status, commit and blame need to determine the status of each file in the repo. And on the TFS side, users become trained NOT to do a TF GET . /r on the root a 50GB folder. Do it once and trust me, you won’t want to do it again.

Having worked under both models, I do think there is a lot more to be said for keeping a repository small than just improved source control performance. It limits the intellectual territory of concern. For someone like myself with severely limited intellectual resources, this really matters. Honestly, this has a tangible impact on our ability to groc source. Here are some scenarios to consider:

You need to find code related to a bug. I often find myself looking for a needle in a haystack unless I am very familiar with the nature of the bug and the related source. It is just too easy to end up with a less than tidy code structure with a single tree.

In this “shared code” folder, what do I care about? If a single repo is maintained under a large division, there might be one or a small number of TFS folders for shared assets like runtimes, dependent libraries, and build tools. Eventually these folders can become huge and often consume the bulk of space. Eventually no single person or team can keep track of everything in it. One might map to this folder but have no idea what they actually need or don’t. Eventually in TFS, one might create “cloaks” to hide particularly large subdirectories but still have a long tail of perhaps hundreds of small unused directories that consume gigabytes of space. Who wants to create hundreds of cloaks or a hundred individual mappings for just the directories I need?

Of course one can organize a TFS repository into logical subtrees and that’s great if you can. However, it is just too easy to cross intellectual boundaries and having several repositories is a nice forcing function to keep the assets organized. Even if this means some duplication, I think that is a perfectly acceptable compromise for being able to quickly pull the code I care about. Disk space is much cheaper than developer hours spent looking for code.

Some possible ways to work with large git repos that did not work for me

So months ago when I began my investigation into how to get my huge TFS workspace under git control, I came across a lot of advice to individuals in similar circumstances. This advice most commonly was one or a combination of the following.

Garbage Collect the Repo

Think of this as a defrag tool for git. Over the course of working in a git repository, objects will become orphaned as they are dereferenced. For example, lets say you want to discard the last several commits with a git reset –hard or if you delete a branch. These objects do not disappear immediately. It may be weeks before git reclaims unreferenced objects. So it is possible that you have a repository with a bunch of stuff that will never be used again and can be discarded. If you want to invoke the git garbage collector immediately and trim away al unused objects, then call:

git gc –aggressive

This may take several minutes or even hours on a very large repository. It is possible that this can improve performance but if you find performance is poor right after a clone, it is unlikely that a GC is going to yield much of an improvement.

Sparse Checkout

This is somewhat new as of git 1.7. It allows you to clone a repository and then specify a part of the tree that you want to keep and git will eliminate the rest from the index and working tree. This may potentially solve some scenarios but it did not address my problem because I needed all 13GB of source.

SubModules

Submodules are separate git repositories that you can include in your repository.The intent here is for including libraries in your project that have their own git repository. You may be working on Project Foo that uses the Bar library. Bar has its own git repository that you want to contribute to and get changes from within your Foo repository. So you clone Bar into a subdirectory of Foo. This can work fine especially for the scenario it was intended for.

I didn’t want to have to maintain each second level directory as a separate repository. This can be especially awkward when branching. Each submodule has to be branched separately. I cant just branch the top level and expect everything below to be branched. Sub modules are really intended for working with logically separate libraries. Now if my large repository was structured in such a way where I did my own work in one collection of subdirectories and another team worked exclusively in their own dedicated subdirectory, Submodules may make more sense, but that was not the case for me.

Splitting the repository

Now we come to the solution that I finally settled on. After spending a couple months trying different things with Git-tf, git-tfs and an internal Microsoft tool, experimenting with various combinations of the above and not being satisfied with anything, I pretty much gave up trying to work in git. I found I was wasting too much time “fighting” the system and needed to move on and actually get some work done.Then about 2 months ago I had a flash of insight into an idea I thought might work and it did! This solution has one major assumption that must be true in order to work: a significant chunk of your TFS repository includes content that you do not regularly or directly contribute to and does not change frequently.

Well over 80% of my repository contains external libraries, build tools and native runtime bits. This is content that rarely changes and is not source that I contribute to or need to pull from regularly. Once I have the C runtime version X, I don’t need to sync that everyday. Part of this discovery emerged as I learned more and more about the nature and content of our repository.  As a new member to the org, staring at a repo with Millions of files not to mention trying to understand a new product, I really had no idea what I needed and did not need. I was pointed to a set of workspace mappings that I was told was what I needed and worked with that. Over time, I became more familiar with various parts but large swaths were still a mystery to me. I took a weekend and with the help of some sysinternals tools like process monitor and watching that as I ran several builds, I learned what I needed to know: what bits does my build actually use and how does it use them.

Roughly this is what I did:

I created one TFS workspace and added all mappings from my original workspace that contained files my team does not contribute to and rarely change. I call this the Nonvolatile workspace.

Create another TFS workspace for the remainder of the mappings that me and the larger team actually churn on and need to be synced regularly. I call this the Volatile workspace.

Ideally I’d like a git repository that tracked all files in the Volatile workspace but I have two problems: 1) Git-Tf and other bridging tools do not map to the mappings in a workspace but need to map to a single server path (like $/root/branch/path) and 2) I need the contents of both workspaces in one directory tree so that they can reference each other correctly. So the following steps take care of this.

I created a TFS “Partial” branch in my own feature area of my Volatile mappings.This is a way to create a branch that branches just a selective number of folders from the parent folder. This is documented by my coworker Chandru Ramakrishnan and you can find more examples on creating partial branches in the BigGit source which creates these for you. This solves the first problem above and allows me to map a git repository to a single TFS folder $/root/branches/mybranch and get just my Volatile content.

I used Git-Tf to clone my partial branch folder to a new Git repository. However I cant build from this since it is missing all the dependencies in my Nonvolatile workspace.

Next for each top level directory in my Nonvolatile workspace, I created a symbolic link in my git repository and I added each of these directories to my .git\info\exclusions file. This is the equivalent of the .gitignore file but it is not publicly versioned.and kept private to my own repo. This solves the second problem by putting all my content in one tree but git does not need to see the nonvolatile content.

Lastly I want a git remote where I can push/pull all my git branches. I don’t feel safe keeping it all on my local machine. Git-Tf only syncs the master branch with the TFS server. So I clone my git repository to a file server //someserver/share and I make this my Origin remote.

Here is a diagram to illustrate this configuration:



My volatile content that is now tracked by git is about 1.5 GB. Yes, a big repository but still very much manageable in git. This is great! Look at meeee! I am working in Git!! But things are still a bit awkward.

That was a lot of steps to set this up. What if I want to do the same on another machine or help others with the same setup. I want to be able to repeat this without errors and having to look up how I did it.

Syncing changes between my Git repo and the parent TFS repository is a lot of work. I have to sync using Git-Tf with my partial branch and then do a TFS merge from my partial branch to the parent folder. If I want to get the latest changes from TFS I need to do a GET in the parent workspace, merge that to my partial branch, check the merge in and then use Git-Tf to pull my partial branch changes into git. Wow. Now I need a nap.

Automating away complexity with BigGit

BigGit is a powershell module I wrote to solve the setup and syncing of this style of repository layout. By importing this module into your powershell session you can either use several of BigGit’s functions to craft your repository or more likely what you will want to do is use the convenience Install-BigGit function to setup your repo in one step. You just need to know which mappings to direct to your volatile and nonvolatile workspaces and then BigGit will create everything for you. Even if your parent workspace is relatively small but requires a bunch of mappings, you can omit the nonvolatile mappings and BigGit will just setup all mappings as volatile and solve the multi mapping git mapping problem. Here is an example of using BigGit to setup your repository:

Here we compose our mappings into separate powershell variables and then use the Install-BigGit function to set everything up. Depending on the size of your original workspace, this could take hours to run but it should be a one time cost. Once this is done, You can sync your changes to TFS using one command that automates all the git-tf syncing and TFS merging discussed above:

This DOES NOT do the final checkin to TFS. It stops at merging the changes into your parent TFS workspace. I thought it would be more desirable to let the user do the final TFS checkin and sanity check the files being checked in.

To pull in the latest changes from TFS use:

It is easy to install BigGit. While you are welcome to download or clone the source from codeplex (and contribute too!!), you may find it easier to download via Chocolatey:

This will download just the module and modify your profile so that BigGit functions are always accessible.

If you want to find out more about BigGit, view the source code or discover the functions it exposes and read their command line documentation, you can find all of that at the codeplex project site. I hope this might help others that find themselves in the same predicament I was in. If you think that BigGit could stand some enhancements or that any analysis in this post is incorrect or lacking, I welcome both comments and pull requests.

Related reading

Why Perforce is more scalable than Git by Steve Hanov

Facebook’s benchmark tests using Git

Is Git recommended for large (>250GB) content repositories on StackOverflow

Show more