Programmers.stackexchange.com

What to do about large svn history when moving to git?

Subscribe to Programmers.stackexchange.com

2015-06-08

Edit: unlike some similar questions such as Moving a multi-GB SVN repo to Git
or
http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git
My scenario doesn't involve several subprojects that can be easily converted into git submoduels, nor a few very large binary files that are well suited for git-annex. It is a single repository where the binaries are the test suite which tightly coupled to the main source code of the same revision, much like if they were compile time assets such as graphics.

I'm investigating switching an old medium/large sized (50 users, 60k revisions, 80Gb history, 2Gb working copy) code repository from svn. As the number of users have grown, there is a lot of churn in trunk, and features are often spread out on multiple commits making code review hard to do. Also without branching there is no way to "gate" bad code out, reviews can only be done after it is committed to trunk. I'm investigating alternatives. I was hoping we could move to git, but I'm having some problems.

The problem with the current repo as far as git goes is size. There is a lot of old cruft in there, and cleaning it with --filter-branch when converting to git can cut it down in size by an order of magnitude, to around 5-10GB. This is still too big. The biggest reason for the large repository size is that there are a lot of binary documents being inputs to tests. These files vary between .5mb and 30mb, and there are hundreds. They also have quite a lot of changes. I have looked at submodules, git-annex etc, but having the tests in a submodule feels wrong, as does having annex for many files for which you want full history.

So the distributed nature of git is really what's blocking me from adopting it. I don't really care about distributed, I just want the cheap branching and powerful merging features. Like I assume 99.9% of git users do, we will use a blessed, bare central repository.

I'm not sure I understand why each user has to have a full local history when using git? If the workflow isn't decentralized, what is that data doing on the users' disks? I know that in recent versions of git you can use a shallow clone with only recent history. My question is: is it viable to do this as the standard mode of operation for an entire team? Can git be configured to always be shallow so you can have a full history only centrally, but users by default only have 1000 revs of history? The option to that of course would be to just convert 1000 revs to git, and keep the svn repo for archeology. In that scenario however, we'd encounter the same problem again after the next several thousand revisions to the test documents.

What is a good best practice for using git with large repos containing many binary files that you do want history for? Most best practices and tutorials seem to avoid this case. They solve the problem of few huge binaries, or propose dropping the binaries entirely.

Is shallow cloning usable as a normal mode of operation or is it a "hack"?

Could submodules be used for code where you have a tight dependency between the main source revision and the submodule revision (such as in compile time binary dependencies, or a unit test suite)?

How big is "too big" for a git repository (on premises)? Should we avoid switching if we can get it down to 4GB? 2GB?